[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators
[ https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753310#comment-16753310 ] Kris Mok commented on SPARK-26741: -- Note that there's a similar issue with non-aggregate functions. Here's an example: {code:none} spark.sql("create table foo (id int, blob binary)") val df = spark.sql("select length(blob) from foo where id = 1 order by length(blob) limit 10") df.explain(true) {code} {code:none} == Parsed Logical Plan == 'GlobalLimit 10 +- 'LocalLimit 10 +- 'Sort ['length('blob) ASC NULLS FIRST], true +- 'Project [unresolvedalias('length('blob), None)] +- 'Filter ('id = 1) +- 'UnresolvedRelation `foo` == Analyzed Logical Plan == length(blob): int GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (id#23 = 1) +- SubqueryAlias `default`.`foo` +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Optimized Logical Plan == GlobalLimit 10 +- LocalLimit 10 +- Project [length(blob)#25] +- Sort [length(blob#24) ASC NULLS FIRST], true +- Project [length(blob#24) AS length(blob)#25, blob#24] +- Filter (isnotnull(id#23) && (id#23 = 1)) +- HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] == Physical Plan == TakeOrderedAndProject(limit=10, orderBy=[length(blob#24) ASC NULLS FIRST], output=[length(blob)#25]) +- *(1) Project [length(blob#24) AS length(blob)#25, blob#24] +- *(1) Filter (isnotnull(id#23) && (id#23 = 1)) +- Scan hive default.foo [blob#24, id#23], HiveTableRelation `default`.`foo`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#23, blob#24] {code} Note how the {{Sort}} operator performs the {{length()}} again, despite there's one in the projection right below it. The root cause of this problem in the Analyzer is the same as the main example in this ticket, although this example is not as harmful (at least it still runs...) > Analyzer incorrectly resolves aggregate function outside of Aggregate > operators > --- > > Key: SPARK-26741 > URL: https://issues.apache.org/jira/browse/SPARK-26741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Priority: Major > > The analyzer can sometimes hit issues with resolving functions. e.g. > {code:sql} > select max(id) > from range(10) > group by id > having count(1) >= 1 > order by max(id) > {code} > The analyzed plan of this query is: > {code:none} > == Analyzed Logical Plan == > max(id): bigint > Project [max(id)#91L] > +- Sort [max(id#88L) ASC NULLS FIRST], true >+- Project [max(id)#91L, id#88L] > +- Filter (count(1)#93L >= cast(1 as bigint)) > +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS > count(1)#93L, id#88L] > +- Range (0, 10, step=1, splits=None) > {code} > Note how an aggregate function is outside of {{Aggregate}} operators in the > fully analyzed plan: > {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. > Trying to run this query will lead to weird issues in codegen, but the root > cause is in the analyzer: > {code:none} > java.lang.UnsupportedOperationException: Cannot generate code for expression: > max(input[1, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at >
[jira] [Resolved] (SPARK-26735) Verify plan integrity for special expressions
[ https://issues.apache.org/jira/browse/SPARK-26735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26735. - Resolution: Fixed Assignee: Kris Mok Fix Version/s: 3.0.0 > Verify plan integrity for special expressions > - > > Key: SPARK-26735 > URL: https://issues.apache.org/jira/browse/SPARK-26735 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Assignee: Kris Mok >Priority: Major > Fix For: 3.0.0 > > > Add verification of plan integrity with regards to special expressions being > hosted only in supported operators. Specifically: > * AggregateExpression: should only be hosted in Aggregate, or indirectly in > Window > * WindowExpression: should only be hosted in Window > * Generator: should only be hosted in Generate -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26744) Create API supportDataType in File Source V2 framework
Gengliang Wang created SPARK-26744: -- Summary: Create API supportDataType in File Source V2 framework Key: SPARK-26744 URL: https://issues.apache.org/jira/browse/SPARK-26744 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Gengliang Wang -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
[ https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26743: Assignee: (was: Apache Spark) > Add a test to check the actual resource limit set via > 'spark.executor.pyspark.memory' > - > > Key: SPARK-26743 > URL: https://issues.apache.org/jira/browse/SPARK-26743 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Looks the test that checks the actual resource limit set (by > 'spark.executor.pyspark.memory') is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
[ https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26743: Assignee: Apache Spark > Add a test to check the actual resource limit set via > 'spark.executor.pyspark.memory' > - > > Key: SPARK-26743 > URL: https://issues.apache.org/jira/browse/SPARK-26743 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Looks the test that checks the actual resource limit set (by > 'spark.executor.pyspark.memory') is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
[ https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26743: - Description: Looks the test that checks the actual resource limit set (by 'spark.executor.pyspark.memory') is missing. (was: Looks the test that checks the actual resource limit set (by 'spark.executor.pyspark.memory'0 is missing.) > Add a test to check the actual resource limit set via > 'spark.executor.pyspark.memory' > - > > Key: SPARK-26743 > URL: https://issues.apache.org/jira/browse/SPARK-26743 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > Looks the test that checks the actual resource limit set (by > 'spark.executor.pyspark.memory') is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26080) Unable to run worker.py on Windows
[ https://issues.apache.org/jira/browse/SPARK-26080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26080: - Labels: release-notes (was: ) > Unable to run worker.py on Windows > -- > > Key: SPARK-26080 > URL: https://issues.apache.org/jira/browse/SPARK-26080 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 > Environment: Windows 10 Education 64 bit >Reporter: Hayden Jeune >Assignee: Hyukjin Kwon >Priority: Blocker > Labels: release-notes > Fix For: 2.4.1, 3.0.0 > > > Use of the resource module in python means worker.py cannot run on a windows > system. This package is only available in unix based environments. > [https://github.com/apache/spark/blob/9a5fda60e532dc7203d21d5fbe385cd561906ccb/python/pyspark/worker.py#L25] > {code:python} > textFile = sc.textFile("README.md") > textFile.first() > {code} > When the above commands are run I receive the error 'worker failed to connect > back', and I can see an exception in the console coming from worker.py saying > 'ModuleNotFoundError: No module named resource' > I do not really know enough about what I'm doing to fix this myself. > Apologies if there's something simple I'm missing here. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
Hyukjin Kwon created SPARK-26743: Summary: Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory' Key: SPARK-26743 URL: https://issues.apache.org/jira/browse/SPARK-26743 Project: Spark Issue Type: Test Components: PySpark Affects Versions: 3.0.0 Reporter: Hyukjin Kwon Looks the test that checks the actual resource limit set (by 'spark.executor.pyspark.memory'0 is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-25981. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 22954 [https://github.com/apache/spark/pull/22954] > Arrow optimization for conversion from R DataFrame to Spark DataFrame > - > > Key: SPARK-25981 > URL: https://issues.apache.org/jira/browse/SPARK-25981 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > PySpark introduced an optimization for toPandas and createDataFrame with > Pandas DataFrame. > This was leveraged by PyArrow API. > R Arrow API is under developement > (https://github.com/apache/arrow/tree/master/r) and about to be released via > CRAN (https://issues.apache.org/jira/browse/ARROW-3204). > Once it's released, we can reuse PySpark's Arrow optimization code path and > leverage it with minimised codes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753248#comment-16753248 ] Dongjoon Hyun commented on SPARK-26630: --- Thank you, [~Deegue] and [~smilegator]. > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Assignee: Deegue >Priority: Major > Fix For: 3.0.0 > > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-25981) Arrow optimization for conversion from R DataFrame to Spark DataFrame
[ https://issues.apache.org/jira/browse/SPARK-25981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-25981: Assignee: Hyukjin Kwon > Arrow optimization for conversion from R DataFrame to Spark DataFrame > - > > Key: SPARK-25981 > URL: https://issues.apache.org/jira/browse/SPARK-25981 > Project: Spark > Issue Type: Improvement > Components: SparkR >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > PySpark introduced an optimization for toPandas and createDataFrame with > Pandas DataFrame. > This was leveraged by PyArrow API. > R Arrow API is under developement > (https://github.com/apache/arrow/tree/master/r) and about to be released via > CRAN (https://issues.apache.org/jira/browse/ARROW-3204). > Once it's released, we can reuse PySpark's Arrow optimization code path and > leverage it with minimised codes. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753218#comment-16753218 ] Dongjoon Hyun commented on SPARK-26663: --- I'm closing this issue as `Cannot Reproduce`. It may depend on your environment. > Cannot query a Hive table with subdirectories > - > > Key: SPARK-26663 > URL: https://issues.apache.org/jira/browse/SPARK-26663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Aäron >Priority: Major > > Hello, > > I want to report the following issue (my first one :) ) > When I create a table in Hive based on a union all then Spark 2.4 is unable > to query this table. > To reproduce: > *Hive 1.2.1* > {code:java} > hive> creat table a(id int); > insert into a values(1); > hive> creat table b(id int); > insert into b values(2); > hive> create table c(id int) as select id from a union all select id from b; > {code} > > *Spark 2.3.1* > > {code:java} > scala> spark.table("c").show > +---+ > | id| > +---+ > | 1| > | 2| > +---+ > scala> spark.table("c").count > res5: Long = 2 > {code} > > *Spark 2.4.0* > {code:java} > scala> spark.table("c").show > 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table > perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using > metastore schema. > +---+ > | id| > +---+ > +---+ > scala> spark.table("c").count > res3: Long = 0 > {code} > I did not find an existing issue for this. Might be important to investigate. > > +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf. > > Kind regards. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26663) Cannot query a Hive table with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26663. --- Resolution: Cannot Reproduce > Cannot query a Hive table with subdirectories > - > > Key: SPARK-26663 > URL: https://issues.apache.org/jira/browse/SPARK-26663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Aäron >Priority: Major > > Hello, > > I want to report the following issue (my first one :) ) > When I create a table in Hive based on a union all then Spark 2.4 is unable > to query this table. > To reproduce: > *Hive 1.2.1* > {code:java} > hive> creat table a(id int); > insert into a values(1); > hive> creat table b(id int); > insert into b values(2); > hive> create table c(id int) as select id from a union all select id from b; > {code} > > *Spark 2.3.1* > > {code:java} > scala> spark.table("c").show > +---+ > | id| > +---+ > | 1| > | 2| > +---+ > scala> spark.table("c").count > res5: Long = 2 > {code} > > *Spark 2.4.0* > {code:java} > scala> spark.table("c").show > 19/01/18 17:00:49 WARN HiveMetastoreCatalog: Unable to infer schema for table > perftest_be.c from file format ORC (inference mode: INFER_AND_SAVE). Using > metastore schema. > +---+ > | id| > +---+ > +---+ > scala> spark.table("c").count > res3: Long = 0 > {code} > I did not find an existing issue for this. Might be important to investigate. > > +Extra info:+ Spark 2.3.1 and 2.4.0 use the same spark-defaults.conf. > > Kind regards. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26663) Cannot query a Hive table with subdirectories
[ https://issues.apache.org/jira/browse/SPARK-26663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753216#comment-16753216 ] Dongjoon Hyun commented on SPARK-26663: --- Hi, [~pomptuintje]. Thank you for reporting. Actually, the given example has incorrect syntax like `creat table`. It would be better if you reports with the script what you used. I do the following but I cannot reproduce the issue. {code:java} Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties hive> create table a(id int); OK Time taken: 1.299 seconds hive> create table b(id int); OK Time taken: 0.046 seconds hive> insert into a values(1); Query ID = dongjoon_20190126145804_2c0252e1-d07c-4213-a387-90efe26d450b Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2019-01-26 14:58:06,272 Stage-1 map = 100%, reduce = 0% Ended Job = job_local2005651311_0001 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: file:/user/hive/warehouse/a/.hive-staging_hive_2019-01-26_14-58-04_030_4426868381325183205-1/-ext-1 Loading data to table default.a Table default.a stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 2.436 seconds hive> insert into b values(1); Query ID = dongjoon_20190126145810_034d9c36-0f23-42a6-ac0a-681839335bd6 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2019-01-26 14:58:11,941 Stage-1 map = 100%, reduce = 0% Ended Job = job_local966105199_0002 Stage-4 is selected by condition resolver. Stage-3 is filtered out by condition resolver. Stage-5 is filtered out by condition resolver. Moving data to: file:/user/hive/warehouse/b/.hive-staging_hive_2019-01-26_14-58-10_554_693159949912597124-1/-ext-1 Loading data to table default.b Table default.b stats: [numFiles=1, numRows=1, totalSize=2, rawDataSize=1] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 1.551 seconds hive> create table c as select id from a union all select id from b; Query ID = dongjoon_20190126145831_c2b31651-c88b-47ab-9081-2375cf064b15 Total jobs = 3 Launching Job 1 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2019-01-26 14:58:33,130 Stage-1 map = 100%, reduce = 0% Ended Job = job_local1725928125_0003 Stage-4 is filtered out by condition resolver. Stage-3 is selected by condition resolver. Stage-5 is filtered out by condition resolver. Launching Job 3 out of 3 Number of reduce tasks is set to 0 since there's no reduce operator Job running in-process (local Hadoop) 2019-01-26 14:58:34,449 Stage-3 map = 100%, reduce = 0% Ended Job = job_local1246940820_0004 Moving data to: file:/user/hive/warehouse/c Table default.c stats: [numFiles=1, numRows=2, totalSize=4, rawDataSize=2] MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Stage-Stage-3: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 2.806 seconds {code} {code:java} 19/01/26 14:58:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context available as 'sc' (master = local[*], app id = local-1548543527800). Spark session available as 'spark'. Welcome to __ / __/__ ___ _/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.4.0 /_/ Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> sql("select * from c").show 19/01/26 14:58:57 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException +---+ | id| +---+ | 1| | 1| +---+ {code} > Cannot query a Hive table with subdirectories > - > > Key: SPARK-26663 > URL: https://issues.apache.org/jira/browse/SPARK-26663 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Aäron >Priority: Major > > Hello, > > I want to report the following issue (my first one :) ) > When I create a table in Hive based on a union all then Spark 2.4 is unable > to query this table. > To reproduce:
[jira] [Commented] (SPARK-26675) Error happened during creating avro files
[ https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753213#comment-16753213 ] Dongjoon Hyun commented on SPARK-26675: --- [~tony0918]. Do you have a problem in Spark Scala Shell environment, too? > Error happened during creating avro files > - > > Key: SPARK-26675 > URL: https://issues.apache.org/jira/browse/SPARK-26675 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Tony Mao >Priority: Major > > Run cmd > {code:java} > spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 > /nke/reformat.py > {code} > code in reformat.py > {code:java} > df = spark.read.option("multiline", "true").json("file:///nke/example1.json") > df.createOrReplaceTempView("traffic") > a = spark.sql("""SELECT store.*, store.name as store_name, > store.dataSupplierId as store_dataSupplierId, trafficSensor.*, > trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as > trafficSensor_dataSupplierId, readings.* > FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as > trafficSensor, > explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""") > b = a.drop("trafficSensors", "trafficSensorReadings", "name", > "dataSupplierId") > b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > {code} > Error message: > {code:java} > Traceback (most recent call last): > File "/nke/reformat.py", line 18, in > b.select("store_name", > "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 736, in save > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, > in deco > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o45.save. > : java.lang.NoSuchMethodError: > org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema; > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at >
[jira] [Commented] (SPARK-26675) Error happened during creating avro files
[ https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753211#comment-16753211 ] Dongjoon Hyun commented on SPARK-26675: --- cc [~Gengliang.Wang] > Error happened during creating avro files > - > > Key: SPARK-26675 > URL: https://issues.apache.org/jira/browse/SPARK-26675 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Tony Mao >Priority: Major > > Run cmd > {code:java} > spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 > /nke/reformat.py > {code} > code in reformat.py > {code:java} > df = spark.read.option("multiline", "true").json("file:///nke/example1.json") > df.createOrReplaceTempView("traffic") > a = spark.sql("""SELECT store.*, store.name as store_name, > store.dataSupplierId as store_dataSupplierId, trafficSensor.*, > trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as > trafficSensor_dataSupplierId, readings.* > FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as > trafficSensor, > explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""") > b = a.drop("trafficSensors", "trafficSensorReadings", "name", > "dataSupplierId") > b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > {code} > Error message: > {code:java} > Traceback (most recent call last): > File "/nke/reformat.py", line 18, in > b.select("store_name", > "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 736, in save > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, > in deco > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o45.save. > : java.lang.NoSuchMethodError: > org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema; > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at >
[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207 ] Dongjoon Hyun commented on SPARK-26742: --- BTW, we are already using 4.1.0. {code} 4.1.0 {code} > Bump Kubernetes Client Version to 4.1.1 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753207#comment-16753207 ] Dongjoon Hyun edited comment on SPARK-26742 at 1/26/19 10:38 PM: - BTW, we are already using 4.1.0 for Spark 3.0.0. {code} 4.1.0 {code} was (Author: dongjoon): BTW, we are already using 4.1.0. {code} 4.1.0 {code} > Bump Kubernetes Client Version to 4.1.1 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1
[ https://issues.apache.org/jira/browse/SPARK-26742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753206#comment-16753206 ] Dongjoon Hyun commented on SPARK-26742: --- SPARK-26603 is working on the testing environment and trying to embrace new K8S versions. > Bump Kubernetes Client Version to 4.1.1 > --- > > Key: SPARK-26742 > URL: https://issues.apache.org/jira/browse/SPARK-26742 > Project: Spark > Issue Type: Dependency upgrade > Components: Kubernetes >Affects Versions: 2.4.0, 3.0.0 >Reporter: Steve Davids >Priority: Major > Labels: easyfix > > Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master > branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest > Kubernetes compatibility support for newer clusters: > https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26742) Bump Kubernetes Client Version to 4.1.1
Steve Davids created SPARK-26742: Summary: Bump Kubernetes Client Version to 4.1.1 Key: SPARK-26742 URL: https://issues.apache.org/jira/browse/SPARK-26742 Project: Spark Issue Type: Dependency upgrade Components: Kubernetes Affects Versions: 2.4.0, 3.0.0 Reporter: Steve Davids Spark 2.x is using Kubernetes Client 3.x which is pretty old, the master branch has 4.0, the client should be upgraded to 4.1.1 to have the broadest Kubernetes compatibility support for newer clusters: https://github.com/fabric8io/kubernetes-client#compatibility-matrix -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators
Kris Mok created SPARK-26741: Summary: Analyzer incorrectly resolves aggregate function outside of Aggregate operators Key: SPARK-26741 URL: https://issues.apache.org/jira/browse/SPARK-26741 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Kris Mok The analyzer can sometimes hit issues with resolving functions. e.g. {code:sql} select max(id) from range(10) group by id having count(1) >= 1 order by max(id) {code} The analyzed plan of this query is: {code:none} == Analyzed Logical Plan == max(id): bigint Project [max(id)#91L] +- Sort [max(id#88L) ASC NULLS FIRST], true +- Project [max(id)#91L, id#88L] +- Filter (count(1)#93L >= cast(1 as bigint)) +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS count(1)#93L, id#88L] +- Range (0, 10, step=1, splits=None) {code} Note how an aggregate function is outside of {{Aggregate}} operators in the fully analyzed plan: {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. Trying to run this query will lead to weird issues in codegen, but the root cause is in the analyzer: {code:none} java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[1, bigint, false]) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) at scala.Option.getOrElse(Option.scala:138) at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at scala.collection.TraversableLike.map(TraversableLike.scala:237) at scala.collection.TraversableLike.map$(TraversableLike.scala:230) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152) at org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195) at org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192) at org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287) at org.apache.spark.sql.Dataset.head(Dataset.scala:2470) at org.apache.spark.sql.Dataset.take(Dataset.scala:2684) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262) at org.apache.spark.sql.Dataset.showString(Dataset.scala:299) at org.apache.spark.sql.Dataset.show(Dataset.scala:753) at org.apache.spark.sql.Dataset.show(Dataset.scala:712) at org.apache.spark.sql.Dataset.show(Dataset.scala:721) {code} The test case {{SPARK-23957 Remove redundant sort from subquery plan(scalar subquery)}} in {{SubquerySuite}} has been disabled because of hitting this issue, caught by SPARK-26735. We should re-enable that test once this bug is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone
[ https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26740: Assignee: Apache Spark > Statistics for date and timestamp columns depend on system time zone > > > Key: SPARK-26740 > URL: https://issues.apache.org/jira/browse/SPARK-26740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > While saving statistics for timestamp/date columns, default time zone is used > in conversion of internal type (microseconds or days since epoch) to textual > representation. The textual representation doesn't contain time zone. So, > when it is converted back to internal types (Long for TimestampType or > DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. > The methods use current system time zone. > If system time zone is different while saving and retrieving statistics for > timestamp/date columns, restored microseconds/days since epoch will be > different. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone
[ https://issues.apache.org/jira/browse/SPARK-26740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26740: Assignee: (was: Apache Spark) > Statistics for date and timestamp columns depend on system time zone > > > Key: SPARK-26740 > URL: https://issues.apache.org/jira/browse/SPARK-26740 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > While saving statistics for timestamp/date columns, default time zone is used > in conversion of internal type (microseconds or days since epoch) to textual > representation. The textual representation doesn't contain time zone. So, > when it is converted back to internal types (Long for TimestampType or > DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. > The methods use current system time zone. > If system time zone is different while saving and retrieving statistics for > timestamp/date columns, restored microseconds/days since epoch will be > different. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26740) Statistics for date and timestamp columns depend on system time zone
Maxim Gekk created SPARK-26740: -- Summary: Statistics for date and timestamp columns depend on system time zone Key: SPARK-26740 URL: https://issues.apache.org/jira/browse/SPARK-26740 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Maxim Gekk While saving statistics for timestamp/date columns, default time zone is used in conversion of internal type (microseconds or days since epoch) to textual representation. The textual representation doesn't contain time zone. So, when it is converted back to internal types (Long for TimestampType or DateType), the Timestamp.valueOf and Date.valueOf are used in conversions. The methods use current system time zone. If system time zone is different while saving and retrieving statistics for timestamp/date columns, restored microseconds/days since epoch will be different. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753146#comment-16753146 ] Xiao Li commented on SPARK-26630: - This is an improvement. Thanks! > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Assignee: Deegue >Priority: Major > Fix For: 3.0.0 > > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-26630. - Resolution: Fixed Fix Version/s: 3.0.0 > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Assignee: Deegue >Priority: Major > Fix For: 3.0.0 > > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-26630: --- Assignee: Deegue > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Assignee: Deegue >Priority: Major > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26630: Summary: Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce (was: ClassCastException in TableReader while creating HadoopRDD) > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Priority: Major > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26630) Support reading Hive-serde tables whose INPUTFORMAT is org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/SPARK-26630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-26630: Issue Type: Improvement (was: Bug) > Support reading Hive-serde tables whose INPUTFORMAT is > org.apache.hadoop.mapreduce > -- > > Key: SPARK-26630 > URL: https://issues.apache.org/jira/browse/SPARK-26630 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 2.4.1, 3.0.0 >Reporter: Deegue >Priority: Major > > This bug found in [link title|https://github.com/apache/spark/pull/23506] (PR > #23506). > It will throw ClassCastException when we use new input format (eg. > `org.apache.hadoop.mapreduce.InputFormat`) to create HadoopRDD.So we need to > use NewHadoopRDD to deal with this input format in TableReader.scala. > Exception : > {noformat} > Caused by: java.lang.ClassCastException: > org.apache.hadoop.mapreduce.lib.input.TextInputFormat cannot be cast to > org.apache.hadoop.mapred.InputFormat > at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:190) > at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:204) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at > org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:49) > at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:254) > at scala.Option.getOrElse(Option.scala:138) > at org.apache.spark.rdd.RDD.partitions(RDD.scala:252) > at org.apache.spark.ShuffleDependency.(Dependency.scala:96) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$.prepareShuffleDependency(ShuffleExchangeExec.scala:343) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:101) > at > org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.$anonfun$doExecute$1(ShuffleExchangeExec.scala:137) > at > org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) > ... 87 more > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skyler Lehan updated SPARK-26739: - Description: h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the on join functions [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame {code} The following functions would be deprecated: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame {code} A new enum would be created called JoinType. Developers would be
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skyler Lehan updated SPARK-26739: - Description: h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the join functions on [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame {code} The following functions would be deprecated: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): DataFrame {code} A new enum would be created called JoinType. Developers would be
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skyler Lehan updated SPARK-26739: - Description: h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the [join functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame] on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame {code} The following functions would be deprecated: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame def join(right: Dataset[_],
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skyler Lehan updated SPARK-26739: - Description: h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the [join functions|"http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame"] on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame {code} The following functions would be deprecated: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame def join(right: Dataset[_],
[jira] [Created] (SPARK-26739) Standardized Join Types for DataFrames
Skyler Lehan created SPARK-26739: Summary: Standardized Join Types for DataFrames Key: SPARK-26739 URL: https://issues.apache.org/jira/browse/SPARK-26739 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Skyler Lehan Fix For: 2.4.1 h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the [join functions|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset@join(right:org.apache.spark.sql.Dataset[_],joinExprs:org.apache.spark.sql.Column,joinType:String):org.apache.spark.sql.DataFrame] on DataFrames, the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column,
[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions
[ https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26656: Assignee: Apache Spark > Benchmark for date/time functions and expressions > - > > Key: SPARK-26656 > URL: https://issues.apache.org/jira/browse/SPARK-26656 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > > Write benchmarks for datetimeExressions -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26656) Benchmark for date/time functions and expressions
[ https://issues.apache.org/jira/browse/SPARK-26656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26656: Assignee: (was: Apache Spark) > Benchmark for date/time functions and expressions > - > > Key: SPARK-26656 > URL: https://issues.apache.org/jira/browse/SPARK-26656 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Maxim Gekk >Priority: Major > > Write benchmarks for datetimeExressions -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26688) Provide configuration of initially blacklisted YARN nodes
[ https://issues.apache.org/jira/browse/SPARK-26688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753005#comment-16753005 ] Sergey commented on SPARK-26688: Hi Imran, thanks for you reply. "Meanwhile devs start to apply this willy-nilly, as these configs tend to just keep getting built up over time " SLA could mitigate this problem. Every blacklisted node for a specific job slows it down in a long run. Anyway, dev would have to report / communicate with ops to resolve node issue. "Ideally, blacklisting and speculation should be able to prevent that problem" We are going to try out speculation but we are not there yet. > Provide configuration of initially blacklisted YARN nodes > - > > Key: SPARK-26688 > URL: https://issues.apache.org/jira/browse/SPARK-26688 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: Attila Zsolt Piros >Priority: Major > > Introducing new config for initially blacklisted YARN nodes. > This came up in the apache spark user mailing list: > [http://apache-spark-user-list.1001560.n3.nabble.com/Spark-on-Yarn-is-it-possible-to-manually-blacklist-nodes-before-running-spark-job-td34395.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26738) Pyspark random forest classifier feature importance with column names
[ https://issues.apache.org/jira/browse/SPARK-26738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Praveen updated SPARK-26738: Description: I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark. The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors. I have included all the stages in a Pipeline {code:java} regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature) idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels) feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100) pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) model = pipeline.fit(df) {code} Generating the feature importances as {code:java} rdc = model.stages[-2] print (rdc.featureImportances) {code} So far so good, but when i try to map the feature importances to the feature columns as below {code:java} attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values( [(name, rdc.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]{code} I get the key error on ml_attr {code:java} KeyError: 'ml_attr'{code} The printed the dictionary, {code:java} print (df_pred.schema["featurescol"].metadata){code} and it's empty {} Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names. Thanks was: I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark. The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors. I have included all the stages in a Pipeline {{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature) idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels) feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100) pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) model = pipeline.fit(df)}}{{}} Generating the feature importances as {code:java} rdc = model.stages[-2] print (rdc.featureImportances) {code} So far so good, but when i try to map the feature importances to the feature columns as below {code:java} attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values( [(name, rdc.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]{code} I get the key error on ml_attr {code:java} KeyError: 'ml_attr'{code} The printed the dictionary, {code:java} print (df_pred.schema["featurescol"].metadata){code} and it's empty {} Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names. Thanks > Pyspark random forest classifier feature importance with column names > - > > Key: SPARK-26738 > URL: https://issues.apache.org/jira/browse/SPARK-26738 > Project: Spark > Issue Type: Question > Components: ML >Affects Versions: 2.3.2 > Environment: {code:java} > {code} >Reporter: Praveen >Priority: Major > Labels: RandomForest, pyspark > > I am trying to plot the feature importances of random forest classifier with > with column names. I am using Spark 2.3.2 and Pyspark. > The input X is sentences and i am using tfidf (HashingTF + IDF) + > StringIndexer to generate the feature vectors. > I have included all the stages in a Pipeline > > {code:java} > regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, > outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, > minTokenLength=minimum_token_size) > hashingTF = HashingTF(inputCol="words",
[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins
[ https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26731: Attachment: LICENSE > remove EOLed spark jobs from jenkins > > > Key: SPARK-26731 > URL: https://issues.apache.org/jira/browse/SPARK-26731 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 1.6.3, 2.0.2, 2.1.3 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz > > > i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 > and 2.1 on jenkins. > these include all test builds, as well as docs, lint, compile, and packaging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26738) Pyspark random forest classifier feature importance with column names
Praveen created SPARK-26738: --- Summary: Pyspark random forest classifier feature importance with column names Key: SPARK-26738 URL: https://issues.apache.org/jira/browse/SPARK-26738 Project: Spark Issue Type: Question Components: ML Affects Versions: 2.3.2 Environment: {code:java} {code} Reporter: Praveen I am trying to plot the feature importances of random forest classifier with with column names. I am using Spark 2.3.2 and Pyspark. The input X is sentences and i am using tfidf (HashingTF + IDF) + StringIndexer to generate the feature vectors. I have included all the stages in a Pipeline {{regexTokenizer = RegexTokenizer(gaps=False, inputCol= raw_data_col, outputCol= "words", pattern="[a-zA-Z_]+", toLowercase=True, minTokenLength=minimum_token_size) hashingTF = HashingTF(inputCol="words", outputCol="rawFeatures", numFeatures=number_of_feature) idf = IDF(inputCol="rawFeatures", outputCol= feature_vec_col) indexer = StringIndexer(inputCol= label_col_name, outputCol= label_vec_name) converter = IndexToString(inputCol='prediction', outputCol="original_label", labels=indexer.fit(df).labels) feature_pipeline = Pipeline(stages=[regexTokenizer, hashingTF, idf, indexer]) estimator = RandomForestClassifier(labelCol=label_col, featuresCol=features_col, numTrees=100) pipeline = Pipeline(stages=[feature_pipeline, estimator, converter]) model = pipeline.fit(df)}}{{}} Generating the feature importances as {code:java} rdc = model.stages[-2] print (rdc.featureImportances) {code} So far so good, but when i try to map the feature importances to the feature columns as below {code:java} attrs = sorted((attr["idx"], attr["name"]) for attr in (chain(*df_pred.schema["featurescol"].metadata["ml_attr"]["attrs"].values( [(name, rdc.featureImportances[idx]) for idx, name in attrs if dtModel_1.featureImportances[idx]]{code} I get the key error on ml_attr {code:java} KeyError: 'ml_attr'{code} The printed the dictionary, {code:java} print (df_pred.schema["featurescol"].metadata){code} and it's empty {} Any thoughts on what I am doing wrong ? How can I getting feature importances to the columns names. Thanks -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7
[ https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26722: Attachment: SPARK-26731.doc > add SPARK_TEST_KEY=1 to pull request builder and > spark-master-test-sbt-hadoop-2.7 > - > > Key: SPARK-26722 > URL: https://issues.apache.org/jira/browse/SPARK-26722 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: SPARK-26731.doc > > > from https://github.com/apache/spark/pull/23117: > we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and > {{spark-master-test-sbt-hadoop-2.7}} builds. > this is done for the PRB, and was manually added to the > {{spark-master-test-sbt-hadoop-2.7}} build. > i will leave this open until i finish porting the JJB configs in to the main > spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-26731) remove EOLed spark jobs from jenkins
[ https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26731: Comment: was deleted (was: {code:java} // code placeholder {code} $ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl}) > remove EOLed spark jobs from jenkins > > > Key: SPARK-26731 > URL: https://issues.apache.org/jira/browse/SPARK-26731 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 1.6.3, 2.0.2, 2.1.3 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz > > > i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 > and 2.1 on jenkins. > these include all test builds, as well as docs, lint, compile, and packaging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26731) remove EOLed spark jobs from jenkins
[ https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752988#comment-16752988 ] Chris Bogan commented on SPARK-26731: - {code:java} // code placeholder {code} $ cd infrastructure/content/dev/ $ $EDITOR infra-site.mdtext $ svn commit ... wait a short while for the page to be rebuilt ... ... **ONLY IF** you are an ASF Member, then publish: ... $ curl -sL http://s.apache.org/cms-cli | perl} > remove EOLed spark jobs from jenkins > > > Key: SPARK-26731 > URL: https://issues.apache.org/jira/browse/SPARK-26731 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 1.6.3, 2.0.2, 2.1.3 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz > > > i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 > and 2.1 on jenkins. > these include all test builds, as well as docs, lint, compile, and packaging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26327: Attachment: request_handler_interface.json > Metrics in FileSourceScanExec not update correctly while > relation.partitionSchema is set > > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > Attachments: Homepage - Material Design, > apache-opennlp-1.9.1-bin.tar.gz, request_handler_interface.json > > > As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and > "metadataTime"(fileListingTime) were updated while lazy val > `selectedPartitions` initialized in the scenario of relation.partitionSchema > is set. But `selectedPartitions` will be initialized by `metadata` at first, > which is called by `queryExecution.toString` in > `SQLExecution.withNewExecutionId`. So while the > `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding > liveExecutions in SQLAppStatusListener, the metrics update is not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26731) remove EOLed spark jobs from jenkins
[ https://issues.apache.org/jira/browse/SPARK-26731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26731: Attachment: activemq-cli-tools-4a984ec.tar.gz > remove EOLed spark jobs from jenkins > > > Key: SPARK-26731 > URL: https://issues.apache.org/jira/browse/SPARK-26731 > Project: Spark > Issue Type: Task > Components: Build >Affects Versions: 1.6.3, 2.0.2, 2.1.3 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: LICENSE, activemq-cli-tools-4a984ec.tar.gz > > > i will disable, but not remove (yet), the branch-specific builds for 1.6, 2.0 > and 2.1 on jenkins. > these include all test builds, as well as docs, lint, compile, and packaging. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26327: Attachment: apache-opennlp-1.9.1-bin.tar.gz > Metrics in FileSourceScanExec not update correctly while > relation.partitionSchema is set > > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > Attachments: Homepage - Material Design, > apache-opennlp-1.9.1-bin.tar.gz > > > As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and > "metadataTime"(fileListingTime) were updated while lazy val > `selectedPartitions` initialized in the scenario of relation.partitionSchema > is set. But `selectedPartitions` will be initialized by `metadata` at first, > which is called by `queryExecution.toString` in > `SQLExecution.withNewExecutionId`. So while the > `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding > liveExecutions in SQLAppStatusListener, the metrics update is not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26327) Metrics in FileSourceScanExec not update correctly while relation.partitionSchema is set
[ https://issues.apache.org/jira/browse/SPARK-26327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris Bogan updated SPARK-26327: Attachment: Homepage - Material Design > Metrics in FileSourceScanExec not update correctly while > relation.partitionSchema is set > > > Key: SPARK-26327 > URL: https://issues.apache.org/jira/browse/SPARK-26327 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.4.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 2.2.3, 2.3.3, 2.4.1, 3.0.0 > > Attachments: Homepage - Material Design > > > As currently approach in `FileSourceScanExec`, the metrics of "numFiles" and > "metadataTime"(fileListingTime) were updated while lazy val > `selectedPartitions` initialized in the scenario of relation.partitionSchema > is set. But `selectedPartitions` will be initialized by `metadata` at first, > which is called by `queryExecution.toString` in > `SQLExecution.withNewExecutionId`. So while the > `SQLMetrics.postDriverMetricUpdates` called, there's no corresponding > liveExecutions in SQLAppStatusListener, the metrics update is not work. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26689) Disk broken causing broadcast failure
[ https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16752961#comment-16752961 ] liupengcheng commented on SPARK-26689: -- [~tgraves] In production environment, yarn.nodemanager.local-dirs is always configured as multiple directories which are mounted on different disks. so I think since we use this parameter in spark, we should also make use of this feature, and should not expect job failure when encountering only a single disk error. This PR I put up can also reduce the FetchFailure and even Job failure caused by FetchFailed if blacklist not enabled or node not blacklisted(task may be repeated scheduled to the unhealthy node) > Disk broken causing broadcast failure > - > > Key: SPARK-26689 > URL: https://issues.apache.org/jira/browse/SPARK-26689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.4.0 > Environment: Spark on Yarn > Mutliple Disk >Reporter: liupengcheng >Priority: Major > > We encoutered an application failure in our production cluster which caused > by the bad disk problems. It will incur application failure. > {code:java} > Job aborted due to stage failure: Task serialization failed: > java.io.IOException: Failed to create local dir in > /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) > org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > scala.Option.foreach(Option.scala:236) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > We have multiple disk on our cluster nodes, however, it still fails. I think > it's because spark does not handle bad disk in `DiskBlockManager` currently. > Actually, we can handle bad disk in multiple disk environment to avoid > application failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org