[jira] [Assigned] (SPARK-26747) Makes GetMapValue nullability more precise
[ https://issues.apache.org/jira/browse/SPARK-26747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26747: Assignee: Apache Spark > Makes GetMapValue nullability more precise > -- > > Key: SPARK-26747 > URL: https://issues.apache.org/jira/browse/SPARK-26747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Assignee: Apache Spark >Priority: Minor > > In master, GetMapValue nullable is always true; > https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371 > But, If input key is foldable, we could make its nullability more precise. > This fix is the same with SPARK-26637. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators
[ https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753748#comment-16753748 ] Xiao Li commented on SPARK-26741: - Could we re-enable the test by making the test case valid? That means, we just need to remove the aggregate function in the order by clause? > Analyzer incorrectly resolves aggregate function outside of Aggregate > operators > --- > > Key: SPARK-26741 > URL: https://issues.apache.org/jira/browse/SPARK-26741 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Kris Mok >Priority: Major > > The analyzer can sometimes hit issues with resolving functions. e.g. > {code:sql} > select max(id) > from range(10) > group by id > having count(1) >= 1 > order by max(id) > {code} > The analyzed plan of this query is: > {code:none} > == Analyzed Logical Plan == > max(id): bigint > Project [max(id)#91L] > +- Sort [max(id#88L) ASC NULLS FIRST], true >+- Project [max(id)#91L, id#88L] > +- Filter (count(1)#93L >= cast(1 as bigint)) > +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS > count(1)#93L, id#88L] > +- Range (0, 10, step=1, splits=None) > {code} > Note how an aggregate function is outside of {{Aggregate}} operators in the > fully analyzed plan: > {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid. > Trying to run this query will lead to weird issues in codegen, but the root > cause is in the analyzer: > {code:none} > java.lang.UnsupportedOperationException: Cannot generate code for expression: > max(input[1, bigint, false]) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291) > at > org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290) > at > org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87) > at > org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138) > at scala.Option.getOrElse(Option.scala:138) > at > org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237) > at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) > at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) > at scala.collection.TraversableLike.map(TraversableLike.scala:237) > at scala.collection.TraversableLike.map$(TraversableLike.scala:230) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152) > at > org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195) > at > org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192) > at > org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153) > at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302) > at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470) > at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287) > at org.apache.spark.sql.Dataset.head(Dataset.scala:2470) > at org.apache.spark.sql.Dataset.take(Dataset.scala:2684) > at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262) > at org.apache.spark.sql.Dataset.showString(Dataset.scala:299) > at org.apache.spark.sql.Dataset.show(Dataset.scala:753) > at org.apache.spark.sql.Dataset.show(Dataset.scala:712) > at org.apache.spark.sql.Dataset.show(Dataset.scala:
[jira] [Assigned] (SPARK-26747) Makes GetMapValue nullability more precise
[ https://issues.apache.org/jira/browse/SPARK-26747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26747: Assignee: (was: Apache Spark) > Makes GetMapValue nullability more precise > -- > > Key: SPARK-26747 > URL: https://issues.apache.org/jira/browse/SPARK-26747 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Takeshi Yamamuro >Priority: Minor > > In master, GetMapValue nullable is always true; > https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371 > But, If input key is foldable, we could make its nullability more precise. > This fix is the same with SPARK-26637. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26747) Makes GetMapValue nullability more precise
Takeshi Yamamuro created SPARK-26747: Summary: Makes GetMapValue nullability more precise Key: SPARK-26747 URL: https://issues.apache.org/jira/browse/SPARK-26747 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Takeshi Yamamuro In master, GetMapValue nullable is always true; https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371 But, If input key is foldable, we could make its nullability more precise. This fix is the same with SPARK-26637. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema
[ https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753697#comment-16753697 ] Apache Spark commented on SPARK-24959: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/23667 > Do not invoke the CSV/JSON parser for empty schema > -- > > Key: SPARK-24959 > URL: https://issues.apache.org/jira/browse/SPARK-24959 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 2.4.0 > > > Currently JSON and CSV parsers are called even if required schema is empty. > Invoking the parser per each line has some non-zero overhead. The action can > be skipped. Such optimization should speed up count(), for example. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites
[ https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-26725. - Resolution: Fixed Fix Version/s: 3.0.0 > Fix the input values of UnifiedMemoryManager constructor in test suites > --- > > Key: SPARK-26725 > URL: https://issues.apache.org/jira/browse/SPARK-26725 > Project: Spark > Issue Type: Test > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Xiao Li >Assignee: Sean Owen >Priority: Minor > Labels: starter > Fix For: 3.0.0 > > > Addressed the comments in > https://github.com/apache/spark/pull/23457#issuecomment-457409976 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26745: - Priority: Blocker (was: Major) > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Priority: Blocker > Labels: correctness > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
[ https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26743. -- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 23663 [https://github.com/apache/spark/pull/23663] > Add a test to check the actual resource limit set via > 'spark.executor.pyspark.memory' > - > > Key: SPARK-26743 > URL: https://issues.apache.org/jira/browse/SPARK-26743 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Looks the test that checks the actual resource limit set (by > 'spark.executor.pyspark.memory') is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26746) Adaptive causes non-action operations to trigger computation
eaton created SPARK-26746: - Summary: Adaptive causes non-action operations to trigger computation Key: SPARK-26746 URL: https://issues.apache.org/jira/browse/SPARK-26746 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 2.3.2, 2.3.1 Reporter: eaton When we turn on the spark. sql. adaptive. enabled switch, the following actions trigger the shuffle calculation, but not when the switch is off: sql("select a, sum(a) from test group by a").rdd The reason is _'_ExchangeCoordinator' submitMapStage too early, the code is like this: while (j < submittedStageFutures.length) { // This call is a blocking call. If the stage has not finished, we will wait at here. mapOutputStatistics(j) = submittedStageFutures(j).get() j += 1 } -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'
[ https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-26743: Assignee: Hyukjin Kwon > Add a test to check the actual resource limit set via > 'spark.executor.pyspark.memory' > - > > Key: SPARK-26743 > URL: https://issues.apache.org/jira/browse/SPARK-26743 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Looks the test that checks the actual resource limit set (by > 'spark.executor.pyspark.memory') is missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753644#comment-16753644 ] Takeshi Yamamuro commented on SPARK-26745: -- I checked I could reproduce this in master/branch-2.4, so I added correctness in the label. > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Priority: Major > Labels: correctness > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-26745: - Labels: correctness (was: ) > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Priority: Major > Labels: correctness > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753635#comment-16753635 ] Branden Smith commented on SPARK-26745: --- opened PR with proposed resolution: https://github.com/apache/spark/pull/23665 > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Priority: Major > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26745: Assignee: (was: Apache Spark) > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Priority: Major > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
[ https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-26745: Assignee: Apache Spark > Non-parsing Dataset.count() optimization causes inconsistent results for JSON > inputs with empty lines > - > > Key: SPARK-26745 > URL: https://issues.apache.org/jira/browse/SPARK-26745 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: Branden Smith >Assignee: Apache Spark >Priority: Major > > The optimization introduced by > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving > performance of {{{color:#FF}count(){color}}} for DataFrames read from > non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to > cause {{{color:#FF}count(){color}}} to erroneously include empty lines in > its result total if run prior to JSON parsing taking place. > For the following input: > {code:json} > { "a" : 1 , "b" : 2 , "c" : 3 } > { "a" : 4 , "b" : 5 , "c" : 6 } > > { "a" : 7 , "b" : 8 , "c" : 9 } > {code} > *+Spark 2.3:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 3 > scala> df.cache.count > res3: Long = 3 > {code} > *+Spark 2.4:+* > {code:scala} > scala> val df = > spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") > df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] > scala> df.count > res0: Long = 7 > scala> df.cache.count > res1: Long = 3 > {code} > Since the count is apparently updated and cached when the Jackson parser > runs, the optimization also causes the count to appear to be unstable upon > cache/persist operations, as shown above. > CSV inputs, also optimized via > [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not > appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines
Branden Smith created SPARK-26745: - Summary: Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines Key: SPARK-26745 URL: https://issues.apache.org/jira/browse/SPARK-26745 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0, 3.0.0 Reporter: Branden Smith The optimization introduced by [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving performance of {{{color:#FF}count(){color}}} for DataFrames read from non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to cause {{{color:#FF}count(){color}}} to erroneously include empty lines in its result total if run prior to JSON parsing taking place. For the following input: {code:json} { "a" : 1 , "b" : 2 , "c" : 3 } { "a" : 4 , "b" : 5 , "c" : 6 } { "a" : 7 , "b" : 8 , "c" : 9 } {code} *+Spark 2.3:+* {code:scala} scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] scala> df.count res0: Long = 3 scala> df.cache.count res3: Long = 3 {code} *+Spark 2.4:+* {code:scala} scala> val df = spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json") df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field] scala> df.count res0: Long = 7 scala> df.cache.count res1: Long = 3 {code} Since the count is apparently updated and cached when the Jackson parser runs, the optimization also causes the count to appear to be unstable upon cache/persist operations, as shown above. CSV inputs, also optimized via [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not appear to be impacted by this effect. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26724) Non negative coefficients for LinearRegression
[ https://issues.apache.org/jira/browse/SPARK-26724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26724. - > Non negative coefficients for LinearRegression > -- > > Key: SPARK-26724 > URL: https://issues.apache.org/jira/browse/SPARK-26724 > Project: Spark > Issue Type: Question > Components: MLlib >Affects Versions: 2.4.0 >Reporter: Alex Chang >Priority: Major > > Hi, > > For > [pyspark.ml.regression.LinearRegression|http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=linearregression#pyspark.ml.regression.LinearRegression], > is there any approach or API to force coefficients to be non negative? > > Thanks > Alex -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-26717) Support PodPriority for spark driver and executor on kubernetes
[ https://issues.apache.org/jira/browse/SPARK-26717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-26717. - > Support PodPriority for spark driver and executor on kubernetes > --- > > Key: SPARK-26717 > URL: https://issues.apache.org/jira/browse/SPARK-26717 > Project: Spark > Issue Type: Wish > Components: Kubernetes >Affects Versions: 2.4.0 >Reporter: Li Gao >Priority: Major > > Hi, > I'd like to see whether we could enhance the current Spark 2.4 on k8s support > to bring in the `PodPriority` feature thats available since k8s v1.11+ > [https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/] > This is helpful in a shared cluster environment to ensure SLA for jobs with > different priorities. > Thanks! > Li -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26379) Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp
[ https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26379: -- Summary: Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp (was: Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp) > Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in > CurrentBatchTimestamp > --- > > Key: SPARK-26379 > URL: https://issues.apache.org/jira/browse/SPARK-26379 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0 >Reporter: Kailash Gupta >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > While using withColumn to add a column to a structured streaming Dataset, I > am getting following exception: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'timestamp > Following is sample code > {code:java} > final String path = "path_to_input_directory"; > final StructType schema = new StructType(new StructField[] { new > StructField("word", StringType, false, Metadata.empty()), new > StructField("count", DataTypes.IntegerType, false, Metadata.empty()) }); > SparkSession sparkSession = > SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate(); > Dataset words = sparkSession.readStream().option("sep", > ",").schema(schema).csv(path); > Dataset wordsWithTimestamp = words.withColumn("timestamp", > functions.current_timestamp()); > // wordsWithTimestamp.explain(true); > StreamingQuery query = > wordsWithTimestamp.writeStream().outputMode("update").option("truncate", > "false").format("console").trigger(Trigger.ProcessingTime("2 > seconds")).start(); > query.awaitTermination();{code} > Following are the contents of the file present at _path_ > {code:java} > a,2 > c,4 > d,2 > r,1 > t,9 > {code} > This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26379) Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
[ https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26379: -- Summary: Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp (was: Fix issue on adding current_timestamp to streaming query) > Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp > -- > > Key: SPARK-26379 > URL: https://issues.apache.org/jira/browse/SPARK-26379 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0 >Reporter: Kailash Gupta >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > While using withColumn to add a column to a structured streaming Dataset, I > am getting following exception: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'timestamp > Following is sample code > {code:java} > final String path = "path_to_input_directory"; > final StructType schema = new StructType(new StructField[] { new > StructField("word", StringType, false, Metadata.empty()), new > StructField("count", DataTypes.IntegerType, false, Metadata.empty()) }); > SparkSession sparkSession = > SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate(); > Dataset words = sparkSession.readStream().option("sep", > ",").schema(schema).csv(path); > Dataset wordsWithTimestamp = words.withColumn("timestamp", > functions.current_timestamp()); > // wordsWithTimestamp.explain(true); > StreamingQuery query = > wordsWithTimestamp.writeStream().outputMode("update").option("truncate", > "false").format("console").trigger(Trigger.ProcessingTime("2 > seconds")).start(); > query.awaitTermination();{code} > Following are the contents of the file present at _path_ > {code:java} > a,2 > c,4 > d,2 > r,1 > t,9 > {code} > This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26379) Fix issue on adding current_timestamp to streaming query
[ https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-26379: -- Summary: Fix issue on adding current_timestamp to streaming query (was: Fix issue on adding current_timestamp/current_date to streaming query) > Fix issue on adding current_timestamp to streaming query > > > Key: SPARK-26379 > URL: https://issues.apache.org/jira/browse/SPARK-26379 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0 >Reporter: Kailash Gupta >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > While using withColumn to add a column to a structured streaming Dataset, I > am getting following exception: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'timestamp > Following is sample code > {code:java} > final String path = "path_to_input_directory"; > final StructType schema = new StructType(new StructField[] { new > StructField("word", StringType, false, Metadata.empty()), new > StructField("count", DataTypes.IntegerType, false, Metadata.empty()) }); > SparkSession sparkSession = > SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate(); > Dataset words = sparkSession.readStream().option("sep", > ",").schema(schema).csv(path); > Dataset wordsWithTimestamp = words.withColumn("timestamp", > functions.current_timestamp()); > // wordsWithTimestamp.explain(true); > StreamingQuery query = > wordsWithTimestamp.writeStream().outputMode("update").option("truncate", > "false").format("console").trigger(Trigger.ProcessingTime("2 > seconds")).start(); > query.awaitTermination();{code} > Following are the contents of the file present at _path_ > {code:java} > a,2 > c,4 > d,2 > r,1 > t,9 > {code} > This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26379) Fix issue on adding current_timestamp/current_date to streaming query
[ https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26379. --- Resolution: Fixed > Fix issue on adding current_timestamp/current_date to streaming query > - > > Key: SPARK-26379 > URL: https://issues.apache.org/jira/browse/SPARK-26379 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0 >Reporter: Kailash Gupta >Assignee: Jungtaek Lim >Priority: Major > Fix For: 2.4.1, 3.0.0 > > > While using withColumn to add a column to a structured streaming Dataset, I > am getting following exception: > org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to > dataType on unresolved object, tree: 'timestamp > Following is sample code > {code:java} > final String path = "path_to_input_directory"; > final StructType schema = new StructType(new StructField[] { new > StructField("word", StringType, false, Metadata.empty()), new > StructField("count", DataTypes.IntegerType, false, Metadata.empty()) }); > SparkSession sparkSession = > SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate(); > Dataset words = sparkSession.readStream().option("sep", > ",").schema(schema).csv(path); > Dataset wordsWithTimestamp = words.withColumn("timestamp", > functions.current_timestamp()); > // wordsWithTimestamp.explain(true); > StreamingQuery query = > wordsWithTimestamp.writeStream().outputMode("update").option("truncate", > "false").format("console").trigger(Trigger.ProcessingTime("2 > seconds")).start(); > query.awaitTermination();{code} > Following are the contents of the file present at _path_ > {code:java} > a,2 > c,4 > d,2 > r,1 > t,9 > {code} > This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-26739: - Fix Version/s: (was: 2.4.1) > Standardized Join Types for DataFrames > -- > > Key: SPARK-26739 > URL: https://issues.apache.org/jira/browse/SPARK-26739 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Skyler Lehan >Priority: Minor > Original Estimate: 48h > Remaining Estimate: 48h > > h3. *Q1.* What are you trying to do? Articulate your objectives using > absolutely no jargon. > Currently, in the join functions on > [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], > the join types are defined via a string parameter called joinType. In order > for a developer to know which joins are possible, they must look up the API > call for join. While this works fine, it can cause the developer to make a > typo resulting in improper joins and/or unexpected errors. The objective of > this improvement would be to allow developers to use a common definition for > join types (by enum or constants) called JoinTypes. This would contain the > possible joins and remove the possibility of a typo. It would also allow > Spark to alter the names of the joins in the future without impacting > end-users. > h3. *Q2.* What problem is this proposal NOT designed to solve? > The problem this solves is extremely narrow, it would not solve anything > other than providing a common definition for join types. > h3. *Q3.* How is it done today, and what are the limits of current practice? > Currently, developers must join two DataFrames like so: > {code:java} > val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), > "left_outer") > {code} > Where they manually type the join type. As stated above, this: > * Requires developers to manually type in the join > * Can cause possibility of typos > * Restricts renaming of join types as its a literal string > * Does not restrict and/or compile check the join type being used, leading > to runtime errors > h3. *Q4.* What is new in your approach and why do you think it will be > successful? > The new approach would use constants, something like this: > {code:java} > val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), > JoinType.LEFT_OUTER) > {code} > This would provide: > * In code reference/definitions of the possible join types > ** This subsequently allows the addition of scaladoc of what each join type > does and how it operates > * Removes possibilities of a typo on the join type > * Provides compile time checking of the join type (only if an enum is used) > To clarify, if JoinType is a constant, it would just fill in the joinType > string parameter for users. If an enum is used, it would restrict the domain > of possible join types to whatever is defined in the future JoinType enum. > The enum is preferred, however it would take longer to implement. > h3. *Q5.* Who cares? If you are successful, what difference will it make? > Developers using Apache Spark will care. This will make the join function > easier to wield and lead to less runtime errors. It will save time by > bringing join type validation at compile time. It will also provide in code > reference to the join types, which saves the developer time of having to look > up and navigate the multiple join functions to find the possible join types. > In addition to that, the resulting constants/enum would have documentation on > how that join type works. > h3. *Q6.* What are the risks? > Users of Apache Spark who currently use strings to define their join types > could be impacted if an enum is chosen as the common definition. This risk > can be mitigated by using string constants. The string constants would be the > exact same string as the string literals used today. For example: > {code:java} > JoinType.INNER = "inner" > {code} > If an enum is still the preferred way of defining the join types, new join > functions could be added that take in these enums and the join calls that > contain string parameters for joinType could be deprecated. This would give > developers a chance to change over to the new join types. > h3. *Q7.* How long will it take? > A few days for a seasoned Spark developer. > h3. *Q8.* What are the mid-term and final "exams" to check for success? > Mid-term exam would be the addition of a common definition of the join types > and additional join functions that take in the join type enum/constant. The > final exam would be working tests written to check the functionality of these > new join functions and the join functions that take a string for joinType > would be deprecated. > h3. *Appendix A.* Proposed API Changes. Optional sect
[jira] [Commented] (SPARK-19254) Support Seq, Map, and Struct in functions.lit
[ https://issues.apache.org/jira/browse/SPARK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753582#comment-16753582 ] Artem Rukavytsia commented on SPARK-19254: -- Looks like Pyspark doesn't support `typedLit`, `from pyspark.sql.functions import typedLit` is failing. Suggest it's the unexpected behavior. > Support Seq, Map, and Struct in functions.lit > - > > Key: SPARK-19254 > URL: https://issues.apache.org/jira/browse/SPARK-19254 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.1.0 >Reporter: Takeshi Yamamuro >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.2.0 > > > In the current implementation, function.lit does not support Seq, Map, and > Struct. This ticket is intended to support them. This is the follow-up of > https://issues.apache.org/jira/browse/SPARK-17683. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26716) Refactor supportDataType API: the supported types of read/write should be consistent
[ https://issues.apache.org/jira/browse/SPARK-26716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-26716. --- Resolution: Fixed Assignee: Gengliang Wang Fix Version/s: 3.0.0 This is resolved via https://github.com/apache/spark/pull/23639 > Refactor supportDataType API: the supported types of read/write should be > consistent > - > > Key: SPARK-26716 > URL: https://issues.apache.org/jira/browse/SPARK-26716 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > 1. rename to supportsDataType > 2. remove parameter isReadPath. The supported types of read/write should be > consistent -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames
[ https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Skyler Lehan updated SPARK-26739: - Description: h3. *Q1.* What are you trying to do? Articulate your objectives using absolutely no jargon. Currently, in the join functions on [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset], the join types are defined via a string parameter called joinType. In order for a developer to know which joins are possible, they must look up the API call for join. While this works fine, it can cause the developer to make a typo resulting in improper joins and/or unexpected errors. The objective of this improvement would be to allow developers to use a common definition for join types (by enum or constants) called JoinTypes. This would contain the possible joins and remove the possibility of a typo. It would also allow Spark to alter the names of the joins in the future without impacting end-users. h3. *Q2.* What problem is this proposal NOT designed to solve? The problem this solves is extremely narrow, it would not solve anything other than providing a common definition for join types. h3. *Q3.* How is it done today, and what are the limits of current practice? Currently, developers must join two DataFrames like so: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer") {code} Where they manually type the join type. As stated above, this: * Requires developers to manually type in the join * Can cause possibility of typos * Restricts renaming of join types as its a literal string * Does not restrict and/or compile check the join type being used, leading to runtime errors h3. *Q4.* What is new in your approach and why do you think it will be successful? The new approach would use constants, something like this: {code:java} val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), JoinType.LEFT_OUTER) {code} This would provide: * In code reference/definitions of the possible join types ** This subsequently allows the addition of scaladoc of what each join type does and how it operates * Removes possibilities of a typo on the join type * Provides compile time checking of the join type (only if an enum is used) To clarify, if JoinType is a constant, it would just fill in the joinType string parameter for users. If an enum is used, it would restrict the domain of possible join types to whatever is defined in the future JoinType enum. The enum is preferred, however it would take longer to implement. h3. *Q5.* Who cares? If you are successful, what difference will it make? Developers using Apache Spark will care. This will make the join function easier to wield and lead to less runtime errors. It will save time by bringing join type validation at compile time. It will also provide in code reference to the join types, which saves the developer time of having to look up and navigate the multiple join functions to find the possible join types. In addition to that, the resulting constants/enum would have documentation on how that join type works. h3. *Q6.* What are the risks? Users of Apache Spark who currently use strings to define their join types could be impacted if an enum is chosen as the common definition. This risk can be mitigated by using string constants. The string constants would be the exact same string as the string literals used today. For example: {code:java} JoinType.INNER = "inner" {code} If an enum is still the preferred way of defining the join types, new join functions could be added that take in these enums and the join calls that contain string parameters for joinType could be deprecated. This would give developers a chance to change over to the new join types. h3. *Q7.* How long will it take? A few days for a seasoned Spark developer. h3. *Q8.* What are the mid-term and final "exams" to check for success? Mid-term exam would be the addition of a common definition of the join types and additional join functions that take in the join type enum/constant. The final exam would be working tests written to check the functionality of these new join functions and the join functions that take a string for joinType would be deprecated. h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, if any. Backward and forward compatibility must be taken into account. *If enums are used:* The following join function signatures would be added to the Dataset API: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): DataFrame {code} The following functions would be deprecated: {code:java} def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): Data
[jira] [Updated] (SPARK-26689) Single disk broken causing broadcast failure
[ https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] liupengcheng updated SPARK-26689: - Summary: Single disk broken causing broadcast failure (was: Disk broken causing broadcast failure) > Single disk broken causing broadcast failure > > > Key: SPARK-26689 > URL: https://issues.apache.org/jira/browse/SPARK-26689 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.4.0 > Environment: Spark on Yarn > Mutliple Disk >Reporter: liupengcheng >Priority: Major > > We encoutered an application failure in our production cluster which caused > by the bad disk problems. It will incur application failure. > {code:java} > Job aborted due to stage failure: Task serialization failed: > java.io.IOException: Failed to create local dir in > /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b. > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73) > org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173) > org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391) > org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801) > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629) > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987) > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99) > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85) > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63) > org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332) > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086) > scala.Option.foreach(Option.scala:236) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086) > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085) > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493) > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482) > org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > We have multiple disk on our cluster nodes, however, it still fails. I think > it's because spark does not handle bad disk in `DiskBlockManager` currently. > Actually, we can handle bad disk in multiple disk environment to avoid > application failure. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26675) Error happened during creating avro files
[ https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753354#comment-16753354 ] Hyukjin Kwon commented on SPARK-26675: -- Also, please make the code as minimised as possible. > Error happened during creating avro files > - > > Key: SPARK-26675 > URL: https://issues.apache.org/jira/browse/SPARK-26675 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Tony Mao >Priority: Major > > Run cmd > {code:java} > spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 > /nke/reformat.py > {code} > code in reformat.py > {code:java} > df = spark.read.option("multiline", "true").json("file:///nke/example1.json") > df.createOrReplaceTempView("traffic") > a = spark.sql("""SELECT store.*, store.name as store_name, > store.dataSupplierId as store_dataSupplierId, trafficSensor.*, > trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as > trafficSensor_dataSupplierId, readings.* > FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as > trafficSensor, > explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""") > b = a.drop("trafficSensors", "trafficSensorReadings", "name", > "dataSupplierId") > b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > {code} > Error message: > {code:java} > Traceback (most recent call last): > File "/nke/reformat.py", line 18, in > b.select("store_name", > "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 736, in save > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, > in deco > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o45.save. > : java.lang.NoSuchMethodError: > org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema; > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConf
[jira] [Updated] (SPARK-26675) Error happened during creating avro files
[ https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26675: - Component/s: SQL > Error happened during creating avro files > - > > Key: SPARK-26675 > URL: https://issues.apache.org/jira/browse/SPARK-26675 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.4.0 >Reporter: Tony Mao >Priority: Major > > Run cmd > {code:java} > spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 > /nke/reformat.py > {code} > code in reformat.py > {code:java} > df = spark.read.option("multiline", "true").json("file:///nke/example1.json") > df.createOrReplaceTempView("traffic") > a = spark.sql("""SELECT store.*, store.name as store_name, > store.dataSupplierId as store_dataSupplierId, trafficSensor.*, > trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as > trafficSensor_dataSupplierId, readings.* > FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as > trafficSensor, > explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""") > b = a.drop("trafficSensors", "trafficSensorReadings", "name", > "dataSupplierId") > b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > {code} > Error message: > {code:java} > Traceback (most recent call last): > File "/nke/reformat.py", line 18, in > b.select("store_name", > "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 736, in save > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, > in deco > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o45.save. > : java.lang.NoSuchMethodError: > org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema; > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.wi
[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
[ https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753353#comment-16753353 ] Hyukjin Kwon commented on SPARK-26727: -- I can't reproduce too. Since this issue is hard to reproduce, please fill the details as elaborated as possible. For the current status, looks no one except you is able to investigate further. > CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException > --- > > Key: SPARK-26727 > URL: https://issues.apache.org/jira/browse/SPARK-26727 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Srinivas Yarra >Priority: Major > > We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW name> AS SELECT FROM " fails with the following exception: > {code:java} > // code placeholder > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view '' already exists in database 'default'; at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314) > at > org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) > at > org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) > at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at > org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at > org.apache.spark.sql.Dataset.(Dataset.scala:195) at > org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at > org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided > {code} > {code} > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res1: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res7: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] > scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy > FROM ae_dual") > org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or > view 'testsparkreplace' already exists in database 'default'; at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) > at > org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236) > at > org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) > at > org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)
[jira] [Updated] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7
[ https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26722: - Description: See https://github.com/apache/spark/pull/23117#issuecomment-456083489 (was: from https://github.com/apache/spark/pull/23117: we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and {{spark-master-test-sbt-hadoop-2.7}} builds. this is done for the PRB, and was manually added to the {{spark-master-test-sbt-hadoop-2.7}} build. i will leave this open until i finish porting the JJB configs in to the main spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).) > Set SPARK_TEST_KEY to pull request builder and > spark-master-test-sbt-hadoop-2.7 > --- > > Key: SPARK-26722 > URL: https://issues.apache.org/jira/browse/SPARK-26722 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: SPARK-26731.doc > > > See https://github.com/apache/spark/pull/23117#issuecomment-456083489 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7
[ https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753352#comment-16753352 ] Hyukjin Kwon commented on SPARK-26722: -- We're synced now and I think it's (mostly) fixed now. There was a bit of miscommunication between me and Shane - sorry I had to clarify it. > Set SPARK_TEST_KEY to pull request builder and > spark-master-test-sbt-hadoop-2.7 > --- > > Key: SPARK-26722 > URL: https://issues.apache.org/jira/browse/SPARK-26722 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: SPARK-26731.doc > > > See https://github.com/apache/spark/pull/23117#issuecomment-456083489 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7
[ https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-26722: - Summary: Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7 (was: add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7) > Set SPARK_TEST_KEY to pull request builder and > spark-master-test-sbt-hadoop-2.7 > --- > > Key: SPARK-26722 > URL: https://issues.apache.org/jira/browse/SPARK-26722 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: SPARK-26731.doc > > > from https://github.com/apache/spark/pull/23117: > we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and > {{spark-master-test-sbt-hadoop-2.7}} builds. > this is done for the PRB, and was manually added to the > {{spark-master-test-sbt-hadoop-2.7}} build. > i will leave this open until i finish porting the JJB configs in to the main > spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7
[ https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-26722. -- Resolution: Fixed Thank you Shane so much for doing this! > add SPARK_TEST_KEY=1 to pull request builder and > spark-master-test-sbt-hadoop-2.7 > - > > Key: SPARK-26722 > URL: https://issues.apache.org/jira/browse/SPARK-26722 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.0.0 >Reporter: shane knapp >Assignee: shane knapp >Priority: Major > Attachments: SPARK-26731.doc > > > from https://github.com/apache/spark/pull/23117: > we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and > {{spark-master-test-sbt-hadoop-2.7}} builds. > this is done for the PRB, and was manually added to the > {{spark-master-test-sbt-hadoop-2.7}} build. > i will leave this open until i finish porting the JJB configs in to the main > spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15368) Spark History Server does not pick up extraClasspath
[ https://issues.apache.org/jira/browse/SPARK-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753339#comment-16753339 ] Sandeep Nemuri commented on SPARK-15368: Just for the records, {{SPARK_DIST_CLASSPATH}} works. > Spark History Server does not pick up extraClasspath > > > Key: SPARK-15368 > URL: https://issues.apache.org/jira/browse/SPARK-15368 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1 > Environment: HDP-2.4 > CentOS >Reporter: Dawson Choong >Priority: Major > > We've encountered a problem where the Spark History Server is not picking up > on the {{spark.driver.extraClassPath}} paremter in the {{Custom > spark-defaults}} inside Ambari. Because the needed JARs are not being picked > up, this is leading to {{ClassNotFoundException}}. (Our current workaround is > to manually export the JARs in the Spark-env.) > Log file: > Spark Command: /usr/java/default/bin/java -Dhdp.version=2.4.0.0-169 -cp > /usr/hdp/2.4.0.0-169/spark/sbin/../conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-core-3.2.10.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/hadoop-client/conf/ > -Xms1g -Xmx1g -XX:MaxPermSize=256m > org.apache.spark.deploy.history.HistoryServer > > 16/04/12 12:23:44 INFO HistoryServer: Registered signal handlers for [TERM, > HUP, INT] > 16/04/12 12:23:45 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 16/04/12 12:23:45 INFO SecurityManager: Changing view acls to: spark > 16/04/12 12:23:45 INFO SecurityManager: Changing modify acls to: spark > 16/04/12 12:23:45 INFO SecurityManager: SecurityManager: authentication > disabled; ui acls disabled; users with view permissions: Set(spark); users > with modify permissions: Set(spark) > Exception in thread "main" java.lang.reflect.InvocationTargetException > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:235) > at > org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala) > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > Class com.wandisco.fs.client.FusionHdfs not found > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195) > at > org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638) > at > org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) > at > org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:362) > at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650) > at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1657) > at > org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:71) > at > org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:49) > ... 6 more > Caused by: java.lang.ClassNotFoundException: Class > com.wandisco.fs.client.FusionHdfs not found > at > org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101) > at > org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193) > ... 17 more -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26675) Error happened during creating avro files
[ https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753326#comment-16753326 ] Gengliang Wang commented on SPARK-26675: [~tony0918] Can you provide a sample input file? > Error happened during creating avro files > - > > Key: SPARK-26675 > URL: https://issues.apache.org/jira/browse/SPARK-26675 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.0 >Reporter: Tony Mao >Priority: Major > > Run cmd > {code:java} > spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 > /nke/reformat.py > {code} > code in reformat.py > {code:java} > df = spark.read.option("multiline", "true").json("file:///nke/example1.json") > df.createOrReplaceTempView("traffic") > a = spark.sql("""SELECT store.*, store.name as store_name, > store.dataSupplierId as store_dataSupplierId, trafficSensor.*, > trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as > trafficSensor_dataSupplierId, readings.* > FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as > trafficSensor, > explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""") > b = a.drop("trafficSensors", "trafficSensorReadings", "name", > "dataSupplierId") > b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > {code} > Error message: > {code:java} > Traceback (most recent call last): > File "/nke/reformat.py", line 18, in > b.select("store_name", > "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro") > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", > line 736, in save > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", > line 1257, in __call__ > File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, > in deco > File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line > 328, in get_return_value > py4j.protocol.Py4JJavaError: An error occurred while calling o45.save. > : java.lang.NoSuchMethodError: > org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema; > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176) > at > org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174) > at scala.collection.Iterator$class.foreach(Iterator.scala:891) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1334) > at scala.collection.IterableLike$class.foreach(IterableLike.scala:72) > at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) > at > org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at > org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118) > at > org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) > at > org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668) > at > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfProp