[jira] [Assigned] (SPARK-26747) Makes GetMapValue nullability more precise

2019-01-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26747:


Assignee: Apache Spark

> Makes GetMapValue nullability more precise
> --
>
> Key: SPARK-26747
> URL: https://issues.apache.org/jira/browse/SPARK-26747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Assignee: Apache Spark
>Priority: Minor
>
> In master, GetMapValue nullable is always true;
> https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371
> But, If input key is foldable, we could make its nullability more precise.
> This fix is the same with SPARK-26637.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26741) Analyzer incorrectly resolves aggregate function outside of Aggregate operators

2019-01-27 Thread Xiao Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753748#comment-16753748
 ] 

Xiao Li commented on SPARK-26741:
-

Could we re-enable the test by making the test case valid? That means, we just 
need to remove the aggregate function in the order by clause? 

> Analyzer incorrectly resolves aggregate function outside of Aggregate 
> operators
> ---
>
> Key: SPARK-26741
> URL: https://issues.apache.org/jira/browse/SPARK-26741
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kris Mok
>Priority: Major
>
> The analyzer can sometimes hit issues with resolving functions. e.g.
> {code:sql}
> select max(id)
> from range(10)
> group by id
> having count(1) >= 1
> order by max(id)
> {code}
> The analyzed plan of this query is:
> {code:none}
> == Analyzed Logical Plan ==
> max(id): bigint
> Project [max(id)#91L]
> +- Sort [max(id#88L) ASC NULLS FIRST], true
>+- Project [max(id)#91L, id#88L]
>   +- Filter (count(1)#93L >= cast(1 as bigint))
>  +- Aggregate [id#88L], [max(id#88L) AS max(id)#91L, count(1) AS 
> count(1)#93L, id#88L]
> +- Range (0, 10, step=1, splits=None)
> {code}
> Note how an aggregate function is outside of {{Aggregate}} operators in the 
> fully analyzed plan:
> {{Sort [max(id#88L) ASC NULLS FIRST], true}}, which makes the plan invalid.
> Trying to run this query will lead to weird issues in codegen, but the root 
> cause is in the analyzer:
> {code:none}
> java.lang.UnsupportedOperationException: Cannot generate code for expression: 
> max(input[1, bigint, false])
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:291)
>   at 
> org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:290)
>   at 
> org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:138)
>   at scala.Option.getOrElse(Option.scala:138)
>   at 
> org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:133)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.$anonfun$createOrderKeys$1(GenerateOrdering.scala:82)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:237)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:237)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:230)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.createOrderKeys(GenerateOrdering.scala:82)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.genComparisons(GenerateOrdering.scala:91)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:152)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateOrdering$.create(GenerateOrdering.scala:44)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.generate(CodeGenerator.scala:1194)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:195)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.LazilyGeneratedOrdering.(GenerateOrdering.scala:192)
>   at 
> org.apache.spark.sql.execution.TakeOrderedAndProjectExec.executeCollect(limit.scala:153)
>   at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3302)
>   at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3291)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:147)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:74)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3287)
>   at org.apache.spark.sql.Dataset.head(Dataset.scala:2470)
>   at org.apache.spark.sql.Dataset.take(Dataset.scala:2684)
>   at org.apache.spark.sql.Dataset.getRows(Dataset.scala:262)
>   at org.apache.spark.sql.Dataset.showString(Dataset.scala:299)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:753)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:712)
>   at org.apache.spark.sql.Dataset.show(Dataset.scala:

[jira] [Assigned] (SPARK-26747) Makes GetMapValue nullability more precise

2019-01-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26747:


Assignee: (was: Apache Spark)

> Makes GetMapValue nullability more precise
> --
>
> Key: SPARK-26747
> URL: https://issues.apache.org/jira/browse/SPARK-26747
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Takeshi Yamamuro
>Priority: Minor
>
> In master, GetMapValue nullable is always true;
> https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371
> But, If input key is foldable, we could make its nullability more precise.
> This fix is the same with SPARK-26637.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26747) Makes GetMapValue nullability more precise

2019-01-27 Thread Takeshi Yamamuro (JIRA)
Takeshi Yamamuro created SPARK-26747:


 Summary: Makes GetMapValue nullability more precise
 Key: SPARK-26747
 URL: https://issues.apache.org/jira/browse/SPARK-26747
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Takeshi Yamamuro


In master, GetMapValue nullable is always true;
https://github.com/apache/spark/blob/cf133e611020ed178f90358464a1b88cdd9b7889/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeExtractors.scala#L371

But, If input key is foldable, we could make its nullability more precise.
This fix is the same with SPARK-26637.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24959) Do not invoke the CSV/JSON parser for empty schema

2019-01-27 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753697#comment-16753697
 ] 

Apache Spark commented on SPARK-24959:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/23667

> Do not invoke the CSV/JSON parser for empty schema
> --
>
> Key: SPARK-24959
> URL: https://issues.apache.org/jira/browse/SPARK-24959
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently JSON and CSV parsers are called even if required schema is empty. 
> Invoking the parser per each line has some non-zero overhead. The action can 
> be skipped. Such optimization should speed up count(), for example.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26725) Fix the input values of UnifiedMemoryManager constructor in test suites

2019-01-27 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-26725.
-
   Resolution: Fixed
Fix Version/s: 3.0.0

> Fix the input values of UnifiedMemoryManager constructor in test suites
> ---
>
> Key: SPARK-26725
> URL: https://issues.apache.org/jira/browse/SPARK-26725
> Project: Spark
>  Issue Type: Test
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Assignee: Sean Owen
>Priority: Minor
>  Labels: starter
> Fix For: 3.0.0
>
>
> Addressed the comments in 
> https://github.com/apache/spark/pull/23457#issuecomment-457409976



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26745:
-
Priority: Blocker  (was: Major)

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Priority: Blocker
>  Labels: correctness
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26743.
--
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 23663
[https://github.com/apache/spark/pull/23663]

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26746) Adaptive causes non-action operations to trigger computation

2019-01-27 Thread eaton (JIRA)
eaton created SPARK-26746:
-

 Summary: Adaptive causes non-action operations to trigger 
computation
 Key: SPARK-26746
 URL: https://issues.apache.org/jira/browse/SPARK-26746
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 2.3.2, 2.3.1
Reporter: eaton


When we turn on the spark. sql. adaptive. enabled switch, the following actions 
trigger the shuffle calculation, but not when the switch is off:

sql("select a, sum(a) from test group by a").rdd

The reason is _'_ExchangeCoordinator' submitMapStage too early, the code is 
like this:

while (j < submittedStageFutures.length) {
 // This call is a blocking call. If the stage has not finished, we will wait 
at here.
 mapOutputStatistics(j) = submittedStageFutures(j).get()
 j += 1
}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26743) Add a test to check the actual resource limit set via 'spark.executor.pyspark.memory'

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-26743:


Assignee: Hyukjin Kwon

> Add a test to check the actual resource limit set via 
> 'spark.executor.pyspark.memory'
> -
>
> Key: SPARK-26743
> URL: https://issues.apache.org/jira/browse/SPARK-26743
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Looks the test that checks the actual resource limit set (by 
> 'spark.executor.pyspark.memory') is missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Takeshi Yamamuro (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753644#comment-16753644
 ] 

Takeshi Yamamuro commented on SPARK-26745:
--

I checked I could reproduce this in master/branch-2.4, so I added correctness 
in the label.

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Priority: Major
>  Labels: correctness
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Takeshi Yamamuro (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-26745:
-
Labels: correctness  (was: )

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Priority: Major
>  Labels: correctness
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Branden Smith (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753635#comment-16753635
 ] 

Branden Smith commented on SPARK-26745:
---

opened PR with proposed resolution: https://github.com/apache/spark/pull/23665

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Priority: Major
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26745:


Assignee: (was: Apache Spark)

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Priority: Major
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-26745:


Assignee: Apache Spark

> Non-parsing Dataset.count() optimization causes inconsistent results for JSON 
> inputs with empty lines
> -
>
> Key: SPARK-26745
> URL: https://issues.apache.org/jira/browse/SPARK-26745
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Branden Smith
>Assignee: Apache Spark
>Priority: Major
>
> The optimization introduced by 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
> performance of {{{color:#FF}count(){color}}} for DataFrames read from 
> non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
> cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
> its result total if run prior to JSON parsing taking place.
> For the following input:
> {code:json}
> { "a" : 1 , "b" : 2 , "c" : 3 }
> { "a" : 4 , "b" : 5 , "c" : 6 }
>  
> { "a" : 7 , "b" : 8 , "c" : 9 }
> {code}
> *+Spark 2.3:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 3
> scala> df.cache.count
> res3: Long = 3
> {code}
> *+Spark 2.4:+*
> {code:scala}
> scala> val df = 
> spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
> df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]
> scala> df.count
> res0: Long = 7
> scala> df.cache.count
> res1: Long = 3
> {code}
> Since the count is apparently updated and cached when the Jackson parser 
> runs, the optimization also causes the count to appear to be unstable upon 
> cache/persist operations, as shown above.
> CSV inputs, also optimized via 
> [SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not 
> appear to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26745) Non-parsing Dataset.count() optimization causes inconsistent results for JSON inputs with empty lines

2019-01-27 Thread Branden Smith (JIRA)
Branden Smith created SPARK-26745:
-

 Summary: Non-parsing Dataset.count() optimization causes 
inconsistent results for JSON inputs with empty lines
 Key: SPARK-26745
 URL: https://issues.apache.org/jira/browse/SPARK-26745
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0, 3.0.0
Reporter: Branden Smith


The optimization introduced by 
[SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959] (improving 
performance of {{{color:#FF}count(){color}}} for DataFrames read from 
non-multiline JSON in {{{color:#FF}PERMISSIVE{color}}} mode) appears to 
cause {{{color:#FF}count(){color}}} to erroneously include empty lines in 
its result total if run prior to JSON parsing taking place.

For the following input:

{code:json}
{ "a" : 1 , "b" : 2 , "c" : 3 }

{ "a" : 4 , "b" : 5 , "c" : 6 }
 
{ "a" : 7 , "b" : 8 , "c" : 9 }


{code}

*+Spark 2.3:+*

{code:scala}
scala> val df = 
spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]

scala> df.count
res0: Long = 3

scala> df.cache.count
res3: Long = 3
{code}

*+Spark 2.4:+*

{code:scala}
scala> val df = 
spark.read.json("sql/core/src/test/resources/test-data/with-empty-line.json")
df: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 1 more field]

scala> df.count
res0: Long = 7

scala> df.cache.count
res1: Long = 3
{code}

Since the count is apparently updated and cached when the Jackson parser runs, 
the optimization also causes the count to appear to be unstable upon 
cache/persist operations, as shown above.

CSV inputs, also optimized via 
[SPARK-24959|https://issues.apache.org/jira/browse/SPARK-24959], do not appear 
to be impacted by this effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26724) Non negative coefficients for LinearRegression

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26724?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26724.
-

> Non negative coefficients for LinearRegression
> --
>
> Key: SPARK-26724
> URL: https://issues.apache.org/jira/browse/SPARK-26724
> Project: Spark
>  Issue Type: Question
>  Components: MLlib
>Affects Versions: 2.4.0
>Reporter: Alex Chang
>Priority: Major
>
> Hi, 
>  
> For 
> [pyspark.ml.regression.LinearRegression|http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=linearregression#pyspark.ml.regression.LinearRegression],
>  is there any approach or API to force coefficients to be non negative?
>  
> Thanks
> Alex



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26717) Support PodPriority for spark driver and executor on kubernetes

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-26717.
-

> Support PodPriority for spark driver and executor on kubernetes
> ---
>
> Key: SPARK-26717
> URL: https://issues.apache.org/jira/browse/SPARK-26717
> Project: Spark
>  Issue Type: Wish
>  Components: Kubernetes
>Affects Versions: 2.4.0
>Reporter: Li Gao
>Priority: Major
>
> Hi,
> I'd like to see whether we could enhance the current Spark 2.4 on k8s support 
> to bring in the `PodPriority` feature thats available since k8s v1.11+
> [https://kubernetes.io/docs/concepts/configuration/pod-priority-preemption/]
> This is helpful in a shared cluster environment to ensure SLA for jobs with 
> different priorities.
> Thanks!
> Li



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26379) Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26379:
--
Summary: Use dummy TimeZoneId for CurrentTimestamp to avoid 
UnresolvedException in CurrentBatchTimestamp  (was: Use dummy TimeZoneId to 
avoid UnresolvedException in CurrentBatchTimestamp)

> Use dummy TimeZoneId for CurrentTimestamp to avoid UnresolvedException in 
> CurrentBatchTimestamp
> ---
>
> Key: SPARK-26379
> URL: https://issues.apache.org/jira/browse/SPARK-26379
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0
>Reporter: Kailash Gupta
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> While using withColumn to add a column to a structured streaming Dataset, I 
> am getting following exception: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'timestamp
> Following is sample code
> {code:java}
> final String path = "path_to_input_directory";
> final StructType schema = new StructType(new StructField[] { new 
> StructField("word", StringType, false, Metadata.empty()), new 
> StructField("count", DataTypes.IntegerType, false, Metadata.empty()) });
> SparkSession sparkSession = 
> SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate();
> Dataset words = sparkSession.readStream().option("sep", 
> ",").schema(schema).csv(path);
> Dataset wordsWithTimestamp = words.withColumn("timestamp", 
> functions.current_timestamp());
> // wordsWithTimestamp.explain(true);
> StreamingQuery query = 
> wordsWithTimestamp.writeStream().outputMode("update").option("truncate", 
> "false").format("console").trigger(Trigger.ProcessingTime("2 
> seconds")).start();
> query.awaitTermination();{code}
> Following are the contents of the file present at _path_
> {code:java}
> a,2
> c,4
> d,2
> r,1
> t,9
> {code}
> This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26379) Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26379:
--
Summary: Use dummy TimeZoneId to avoid UnresolvedException in 
CurrentBatchTimestamp  (was: Fix issue on adding current_timestamp to streaming 
query)

> Use dummy TimeZoneId to avoid UnresolvedException in CurrentBatchTimestamp
> --
>
> Key: SPARK-26379
> URL: https://issues.apache.org/jira/browse/SPARK-26379
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0
>Reporter: Kailash Gupta
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> While using withColumn to add a column to a structured streaming Dataset, I 
> am getting following exception: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'timestamp
> Following is sample code
> {code:java}
> final String path = "path_to_input_directory";
> final StructType schema = new StructType(new StructField[] { new 
> StructField("word", StringType, false, Metadata.empty()), new 
> StructField("count", DataTypes.IntegerType, false, Metadata.empty()) });
> SparkSession sparkSession = 
> SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate();
> Dataset words = sparkSession.readStream().option("sep", 
> ",").schema(schema).csv(path);
> Dataset wordsWithTimestamp = words.withColumn("timestamp", 
> functions.current_timestamp());
> // wordsWithTimestamp.explain(true);
> StreamingQuery query = 
> wordsWithTimestamp.writeStream().outputMode("update").option("truncate", 
> "false").format("console").trigger(Trigger.ProcessingTime("2 
> seconds")).start();
> query.awaitTermination();{code}
> Following are the contents of the file present at _path_
> {code:java}
> a,2
> c,4
> d,2
> r,1
> t,9
> {code}
> This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26379) Fix issue on adding current_timestamp to streaming query

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-26379:
--
Summary: Fix issue on adding current_timestamp to streaming query  (was: 
Fix issue on adding current_timestamp/current_date to streaming query)

> Fix issue on adding current_timestamp to streaming query
> 
>
> Key: SPARK-26379
> URL: https://issues.apache.org/jira/browse/SPARK-26379
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0
>Reporter: Kailash Gupta
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> While using withColumn to add a column to a structured streaming Dataset, I 
> am getting following exception: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'timestamp
> Following is sample code
> {code:java}
> final String path = "path_to_input_directory";
> final StructType schema = new StructType(new StructField[] { new 
> StructField("word", StringType, false, Metadata.empty()), new 
> StructField("count", DataTypes.IntegerType, false, Metadata.empty()) });
> SparkSession sparkSession = 
> SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate();
> Dataset words = sparkSession.readStream().option("sep", 
> ",").schema(schema).csv(path);
> Dataset wordsWithTimestamp = words.withColumn("timestamp", 
> functions.current_timestamp());
> // wordsWithTimestamp.explain(true);
> StreamingQuery query = 
> wordsWithTimestamp.writeStream().outputMode("update").option("truncate", 
> "false").format("console").trigger(Trigger.ProcessingTime("2 
> seconds")).start();
> query.awaitTermination();{code}
> Following are the contents of the file present at _path_
> {code:java}
> a,2
> c,4
> d,2
> r,1
> t,9
> {code}
> This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26379) Fix issue on adding current_timestamp/current_date to streaming query

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26379?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26379.
---
Resolution: Fixed

> Fix issue on adding current_timestamp/current_date to streaming query
> -
>
> Key: SPARK-26379
> URL: https://issues.apache.org/jira/browse/SPARK-26379
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.0, 3.0.0
>Reporter: Kailash Gupta
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 2.4.1, 3.0.0
>
>
> While using withColumn to add a column to a structured streaming Dataset, I 
> am getting following exception: 
> org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
> dataType on unresolved object, tree: 'timestamp
> Following is sample code
> {code:java}
> final String path = "path_to_input_directory";
> final StructType schema = new StructType(new StructField[] { new 
> StructField("word", StringType, false, Metadata.empty()), new 
> StructField("count", DataTypes.IntegerType, false, Metadata.empty()) });
> SparkSession sparkSession = 
> SparkSession.builder().appName("StructuredStreamingIssue").master("local").getOrCreate();
> Dataset words = sparkSession.readStream().option("sep", 
> ",").schema(schema).csv(path);
> Dataset wordsWithTimestamp = words.withColumn("timestamp", 
> functions.current_timestamp());
> // wordsWithTimestamp.explain(true);
> StreamingQuery query = 
> wordsWithTimestamp.writeStream().outputMode("update").option("truncate", 
> "false").format("console").trigger(Trigger.ProcessingTime("2 
> seconds")).start();
> query.awaitTermination();{code}
> Following are the contents of the file present at _path_
> {code:java}
> a,2
> c,4
> d,2
> r,1
> t,9
> {code}
> This seems working with 2.2.0 release, but not with 2.3.0 and 2.4.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-27 Thread Jungtaek Lim (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-26739:
-
Fix Version/s: (was: 2.4.1)

> Standardized Join Types for DataFrames
> --
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Skyler Lehan
>Priority: Minor
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using 
> absolutely no jargon.
> Currently, in the join functions on 
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
>  the join types are defined via a string parameter called joinType. In order 
> for a developer to know which joins are possible, they must look up the API 
> call for join. While this works fine, it can cause the developer to make a 
> typo resulting in improper joins and/or unexpected errors. The objective of 
> this improvement would be to allow developers to use a common definition for 
> join types (by enum or constants) called JoinTypes. This would contain the 
> possible joins and remove the possibility of a typo. It would also allow 
> Spark to alter the names of the joins in the future without impacting 
> end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything 
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
>  * Requires developers to manually type in the join
>  * Can cause possibility of typos
>  * Restricts renaming of join types as its a literal string
>  * Does not restrict and/or compile check the join type being used, leading 
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be 
> successful?
> The new approach would use constants, something like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
>  * In code reference/definitions of the possible join types
>  ** This subsequently allows the addition of scaladoc of what each join type 
> does and how it operates
>  * Removes possibilities of a typo on the join type
>  * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType 
> string parameter for users. If an enum is used, it would restrict the domain 
> of possible join types to whatever is defined in the future JoinType enum. 
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function 
> easier to wield and lead to less runtime errors. It will save time by 
> bringing join type validation at compile time. It will also provide in code 
> reference to the join types, which saves the developer time of having to look 
> up and navigate the multiple join functions to find the possible join types. 
> In addition to that, the resulting constants/enum would have documentation on 
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types 
> could be impacted if an enum is chosen as the common definition. This risk 
> can be mitigated by using string constants. The string constants would be the 
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join 
> functions could be added that take in these enums and the join calls that 
> contain string parameters for joinType could be deprecated. This would give 
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types 
> and additional join functions that take in the join type enum/constant. The 
> final exam would be working tests written to check the functionality of these 
> new join functions and the join functions that take a string for joinType 
> would be deprecated.
> h3. *Appendix A.* Proposed API Changes. Optional sect

[jira] [Commented] (SPARK-19254) Support Seq, Map, and Struct in functions.lit

2019-01-27 Thread Artem Rukavytsia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753582#comment-16753582
 ] 

Artem Rukavytsia commented on SPARK-19254:
--

Looks like Pyspark doesn't support `typedLit`, `from pyspark.sql.functions 
import typedLit` is failing. Suggest it's the unexpected behavior. 

> Support Seq, Map, and Struct in functions.lit
> -
>
> Key: SPARK-19254
> URL: https://issues.apache.org/jira/browse/SPARK-19254
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Takeshi Yamamuro
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.2.0
>
>
> In the current implementation, function.lit does not support Seq, Map, and 
> Struct. This ticket is intended to support them. This is the follow-up of 
> https://issues.apache.org/jira/browse/SPARK-17683.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26716) Refactor supportDataType API: the supported types of read/write should be consistent

2019-01-27 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26716?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-26716.
---
   Resolution: Fixed
 Assignee: Gengliang Wang
Fix Version/s: 3.0.0

This is resolved via https://github.com/apache/spark/pull/23639

> Refactor supportDataType API:  the supported types of read/write should be 
> consistent
> -
>
> Key: SPARK-26716
> URL: https://issues.apache.org/jira/browse/SPARK-26716
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> 1. rename to supportsDataType
> 2. remove parameter isReadPath. The supported types of read/write should be 
> consistent



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26739) Standardized Join Types for DataFrames

2019-01-27 Thread Skyler Lehan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Skyler Lehan updated SPARK-26739:
-
Description: 
h3. *Q1.* What are you trying to do? Articulate your objectives using 
absolutely no jargon.

Currently, in the join functions on 
[DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
 the join types are defined via a string parameter called joinType. In order 
for a developer to know which joins are possible, they must look up the API 
call for join. While this works fine, it can cause the developer to make a typo 
resulting in improper joins and/or unexpected errors. The objective of this 
improvement would be to allow developers to use a common definition for join 
types (by enum or constants) called JoinTypes. This would contain the possible 
joins and remove the possibility of a typo. It would also allow Spark to alter 
the names of the joins in the future without impacting end-users.
h3. *Q2.* What problem is this proposal NOT designed to solve?

The problem this solves is extremely narrow, it would not solve anything other 
than providing a common definition for join types.
h3. *Q3.* How is it done today, and what are the limits of current practice?

Currently, developers must join two DataFrames like so:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), "left_outer")
{code}
Where they manually type the join type. As stated above, this:
 * Requires developers to manually type in the join
 * Can cause possibility of typos
 * Restricts renaming of join types as its a literal string
 * Does not restrict and/or compile check the join type being used, leading to 
runtime errors

h3. *Q4.* What is new in your approach and why do you think it will be 
successful?

The new approach would use constants, something like this:
{code:java}
val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"), 
JoinType.LEFT_OUTER)
{code}
This would provide:
 * In code reference/definitions of the possible join types
 ** This subsequently allows the addition of scaladoc of what each join type 
does and how it operates
 * Removes possibilities of a typo on the join type
 * Provides compile time checking of the join type (only if an enum is used)

To clarify, if JoinType is a constant, it would just fill in the joinType 
string parameter for users. If an enum is used, it would restrict the domain of 
possible join types to whatever is defined in the future JoinType enum. The 
enum is preferred, however it would take longer to implement.
h3. *Q5.* Who cares? If you are successful, what difference will it make?

Developers using Apache Spark will care. This will make the join function 
easier to wield and lead to less runtime errors. It will save time by bringing 
join type validation at compile time. It will also provide in code reference to 
the join types, which saves the developer time of having to look up and 
navigate the multiple join functions to find the possible join types. In 
addition to that, the resulting constants/enum would have documentation on how 
that join type works.
h3. *Q6.* What are the risks?

Users of Apache Spark who currently use strings to define their join types 
could be impacted if an enum is chosen as the common definition. This risk can 
be mitigated by using string constants. The string constants would be the exact 
same string as the string literals used today. For example:
{code:java}
JoinType.INNER = "inner"
{code}
If an enum is still the preferred way of defining the join types, new join 
functions could be added that take in these enums and the join calls that 
contain string parameters for joinType could be deprecated. This would give 
developers a chance to change over to the new join types.
h3. *Q7.* How long will it take?

A few days for a seasoned Spark developer.
h3. *Q8.* What are the mid-term and final "exams" to check for success?

Mid-term exam would be the addition of a common definition of the join types 
and additional join functions that take in the join type enum/constant. The 
final exam would be working tests written to check the functionality of these 
new join functions and the join functions that take a string for joinType would 
be deprecated.
h3. *Appendix A.* Proposed API Changes. Optional section defining APIs changes, 
if any. Backward and forward compatibility must be taken into account.

*If enums are used:*

The following join function signatures would be added to the Dataset API:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType): 
DataFrame
{code}
The following functions would be deprecated:
{code:java}
def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
def join(right: Dataset[_], usingColumns: Seq[String], joinType: String): 
Data

[jira] [Updated] (SPARK-26689) Single disk broken causing broadcast failure

2019-01-27 Thread liupengcheng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26689?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

liupengcheng updated SPARK-26689:
-
Summary: Single disk broken causing broadcast failure  (was: Disk broken 
causing broadcast failure)

> Single disk broken causing broadcast failure
> 
>
> Key: SPARK-26689
> URL: https://issues.apache.org/jira/browse/SPARK-26689
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.4.0
> Environment: Spark on Yarn
> Mutliple Disk
>Reporter: liupengcheng
>Priority: Major
>
> We encoutered an application failure in our production cluster which caused 
> by the bad disk problems. It will incur application failure.
> {code:java}
> Job aborted due to stage failure: Task serialization failed: 
> java.io.IOException: Failed to create local dir in 
> /home/work/hdd5/yarn/c3prc-hadoop/nodemanager/usercache/h_user_profile/appcache/application_1463372393999_144979/blockmgr-1f96b724-3e16-4c09-8601-1a2e3b758185/3b.
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:73)
> org.apache.spark.storage.DiskStore.contains(DiskStore.scala:173)
> org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$getCurrentBlockStatus(BlockManager.scala:391)
> org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:629)
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:987)
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:85)
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
> org.apache.spark.SparkContext.broadcast(SparkContext.scala:1332)
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:863)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply$mcVI$sp(DAGScheduler.scala:1090)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14$$anonfun$apply$1.apply(DAGScheduler.scala:1086)
> scala.Option.foreach(Option.scala:236)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1086)
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$14.apply(DAGScheduler.scala:1085)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1085)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1528)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1493)
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1482)
> org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> We have multiple disk on our cluster nodes, however, it still fails. I think 
> it's because spark does not handle bad disk in `DiskBlockManager` currently. 
> Actually, we can handle bad disk in multiple disk environment to avoid 
> application failure.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26675) Error happened during creating avro files

2019-01-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753354#comment-16753354
 ] 

Hyukjin Kwon commented on SPARK-26675:
--

Also, please make the code as minimised as possible.

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConf

[jira] [Updated] (SPARK-26675) Error happened during creating avro files

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26675:
-
Component/s: SQL

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
> at 
> org.apache.spark.sql.execution.SQLExecution$.wi

[jira] [Commented] (SPARK-26727) CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException

2019-01-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753353#comment-16753353
 ] 

Hyukjin Kwon commented on SPARK-26727:
--

I can't reproduce too. Since this issue is hard to reproduce, please fill the 
details as elaborated as possible. For the current status, looks no one except 
you is able to investigate further.

> CREATE OR REPLACE VIEW query fails with TableAlreadyExistsException
> ---
>
> Key: SPARK-26727
> URL: https://issues.apache.org/jira/browse/SPARK-26727
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Srinivas Yarra
>Priority: Major
>
> We experienced that sometimes the Hive query "CREATE OR REPLACE VIEW  name> AS SELECT  FROM " fails with the following exception:
> {code:java}
> // code placeholder
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view '' already exists in database 'default'; at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:314)
>  at 
> org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:165) 
> at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>  at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>  at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$$anonfun$53.apply(Dataset.scala:3365) at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3364) at 
> org.apache.spark.sql.Dataset.(Dataset.scala:195) at 
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:80) at 
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642) ... 49 elided
> {code}
> {code}
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res1: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res2: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res3: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res4: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res5: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res6: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res7: org.apache.spark.sql.DataFrame = []
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res8: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res9: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res10: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") res11: org.apache.spark.sql.DataFrame = [] 
> scala> spark.sql("CREATE OR REPLACE VIEW testSparkReplace as SELECT dummy 
> FROM ae_dual") 
> org.apache.spark.sql.catalyst.analysis.TableAlreadyExistsException: Table or 
> view 'testsparkreplace' already exists in database 'default'; at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply$mcV$sp(HiveExternalCatalog.scala:246)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$createTable$1.apply(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:236)
>  at 
> org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
>  at 
> org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:319)

[jira] [Updated] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26722:
-
Description: See 
https://github.com/apache/spark/pull/23117#issuecomment-456083489  (was: from 
https://github.com/apache/spark/pull/23117:

we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
{{spark-master-test-sbt-hadoop-2.7}} builds.

this is done for the PRB, and was manually added to the 
{{spark-master-test-sbt-hadoop-2.7}} build.

i will leave this open until i finish porting the JJB configs in to the main 
spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).)

> Set SPARK_TEST_KEY to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> ---
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> See https://github.com/apache/spark/pull/23117#issuecomment-456083489



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-27 Thread Hyukjin Kwon (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753352#comment-16753352
 ] 

Hyukjin Kwon commented on SPARK-26722:
--

We're synced now and I think it's (mostly) fixed now. There was a bit of 
miscommunication between me and Shane - sorry I had to clarify it. 

> Set SPARK_TEST_KEY to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> ---
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> See https://github.com/apache/spark/pull/23117#issuecomment-456083489



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26722) Set SPARK_TEST_KEY to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-26722:
-
Summary: Set SPARK_TEST_KEY to pull request builder and 
spark-master-test-sbt-hadoop-2.7  (was: add SPARK_TEST_KEY=1 to pull request 
builder and spark-master-test-sbt-hadoop-2.7)

> Set SPARK_TEST_KEY to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> ---
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> from https://github.com/apache/spark/pull/23117:
> we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
> {{spark-master-test-sbt-hadoop-2.7}} builds.
> this is done for the PRB, and was manually added to the 
> {{spark-master-test-sbt-hadoop-2.7}} build.
> i will leave this open until i finish porting the JJB configs in to the main 
> spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26722) add SPARK_TEST_KEY=1 to pull request builder and spark-master-test-sbt-hadoop-2.7

2019-01-27 Thread Hyukjin Kwon (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-26722.
--
Resolution: Fixed

Thank you Shane so much for doing this!

> add SPARK_TEST_KEY=1 to pull request builder and 
> spark-master-test-sbt-hadoop-2.7
> -
>
> Key: SPARK-26722
> URL: https://issues.apache.org/jira/browse/SPARK-26722
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: shane knapp
>Assignee: shane knapp
>Priority: Major
> Attachments: SPARK-26731.doc
>
>
> from https://github.com/apache/spark/pull/23117:
> we need to add the {{SPARK_TEST_KEY=1}} env var to both the GHPRB and 
> {{spark-master-test-sbt-hadoop-2.7}} builds.
> this is done for the PRB, and was manually added to the 
> {{spark-master-test-sbt-hadoop-2.7}} build.
> i will leave this open until i finish porting the JJB configs in to the main 
> spark repo (for the {{spark-master-test-sbt-hadoop-2.7}} build).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15368) Spark History Server does not pick up extraClasspath

2019-01-27 Thread Sandeep Nemuri (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753339#comment-16753339
 ] 

Sandeep Nemuri commented on SPARK-15368:


Just for the records, {{SPARK_DIST_CLASSPATH}} works.

> Spark History Server does not pick up extraClasspath
> 
>
> Key: SPARK-15368
> URL: https://issues.apache.org/jira/browse/SPARK-15368
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1
> Environment: HDP-2.4
> CentOS
>Reporter: Dawson Choong
>Priority: Major
>
> We've encountered a problem where the Spark History Server is not picking up 
> on the {{spark.driver.extraClassPath}} paremter in the {{Custom 
> spark-defaults}} inside Ambari. Because the needed JARs are not being picked 
> up, this is leading to {{ClassNotFoundException}}. (Our current workaround is 
> to manually export the JARs in the Spark-env.)
> Log file:
> Spark Command: /usr/java/default/bin/java -Dhdp.version=2.4.0.0-169 -cp 
> /usr/hdp/2.4.0.0-169/spark/sbin/../conf/:/usr/hdp/2.4.0.0-169/spark/lib/spark-assembly-1.6.0.2.4.0.0-169-hadoop2.7.1.2.4.0.0-169.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-core-3.2.10.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-rdbms-3.2.9.jar:/usr/hdp/2.4.0.0-169/spark/lib/datanucleus-api-jdo-3.2.6.jar:/usr/hdp/current/hadoop-client/conf/
>  -Xms1g -Xmx1g -XX:MaxPermSize=256m 
> org.apache.spark.deploy.history.HistoryServer
> 
> 16/04/12 12:23:44 INFO HistoryServer: Registered signal handlers for [TERM, 
> HUP, INT]
> 16/04/12 12:23:45 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> 16/04/12 12:23:45 INFO SecurityManager: Changing view acls to: spark
> 16/04/12 12:23:45 INFO SecurityManager: Changing modify acls to: spark
> 16/04/12 12:23:45 INFO SecurityManager: SecurityManager: authentication 
> disabled; ui acls disabled; users with view permissions: Set(spark); users 
> with modify permissions: Set(spark)
> Exception in thread "main" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:235)
>   at 
> org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> Class com.wandisco.fs.client.FusionHdfs not found
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
>   at 
> org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2638)
>   at 
> org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2651)
>   at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
>   at 
> org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
>   at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
>   at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:362)
>   at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1650)
>   at org.apache.spark.util.Utils$.getHadoopFileSystem(Utils.scala:1657)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:71)
>   at 
> org.apache.spark.deploy.history.FsHistoryProvider.(FsHistoryProvider.scala:49)
>   ... 6 more
> Caused by: java.lang.ClassNotFoundException: Class 
> com.wandisco.fs.client.FusionHdfs not found
>   at 
> org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
>   at 
> org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2193)
>   ... 17 more



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26675) Error happened during creating avro files

2019-01-27 Thread Gengliang Wang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16753326#comment-16753326
 ] 

Gengliang Wang commented on SPARK-26675:


[~tony0918] Can you provide a sample input file? 

> Error happened during creating avro files
> -
>
> Key: SPARK-26675
> URL: https://issues.apache.org/jira/browse/SPARK-26675
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.0
>Reporter: Tony Mao
>Priority: Major
>
> Run cmd
> {code:java}
> spark-submit --packages org.apache.spark:spark-avro_2.11:2.4.0 
> /nke/reformat.py
> {code}
> code in reformat.py
> {code:java}
> df = spark.read.option("multiline", "true").json("file:///nke/example1.json")
> df.createOrReplaceTempView("traffic")
> a = spark.sql("""SELECT store.*, store.name as store_name, 
> store.dataSupplierId as store_dataSupplierId, trafficSensor.*,
> trafficSensor.name as trafficSensor_name, trafficSensor.dataSupplierId as 
> trafficSensor_dataSupplierId, readings.*
> FROM (SELECT explode(stores) as store, explode(store.trafficSensors) as 
> trafficSensor,
> explode(trafficSensor.trafficSensorReadings) as readings FROM traffic)""")
> b = a.drop("trafficSensors", "trafficSensorReadings", "name", 
> "dataSupplierId")
> b.write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> {code}
> Error message:
> {code:java}
> Traceback (most recent call last):
> File "/nke/reformat.py", line 18, in 
> b.select("store_name", 
> "store_dataSupplierId").write.format("avro").save("file:///nke/curated/namesAndFavColors.avro")
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/readwriter.py", 
> line 736, in save
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", 
> line 1257, in __call__
> File "/usr/spark-2.4.0/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, 
> in deco
> File "/usr/spark-2.4.0/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 
> 328, in get_return_value
> py4j.protocol.Py4JJavaError: An error occurred while calling o45.save.
> : java.lang.NoSuchMethodError: 
> org.apache.avro.Schema.createUnion([Lorg/apache/avro/Schema;)Lorg/apache/avro/Schema;
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:185)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:176)
> at 
> org.apache.spark.sql.avro.SchemaConverters$$anonfun$5.apply(SchemaConverters.scala:174)
> at scala.collection.Iterator$class.foreach(Iterator.scala:891)
> at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
> at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
> at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
> at 
> org.apache.spark.sql.avro.SchemaConverters$.toAvroType(SchemaConverters.scala:174)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.avro.AvroFileFormat$$anonfun$7.apply(AvroFileFormat.scala:118)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.avro.AvroFileFormat.prepareWrite(AvroFileFormat.scala:118)
> at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
> at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
> at 
> org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
> at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
> at 
> org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
> at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfProp