date:20200311

[jira] [Created] (SPARK-31129) IntervalBenchmark and DateTimeBenchmark fails to run

2020-03-11 Thread Kent Yao (Jira)

Kent Yao created SPARK-31129:


 Summary: IntervalBenchmark and DateTimeBenchmark fails to run
 Key: SPARK-31129
 URL: https://issues.apache.org/jira/browse/SPARK-31129
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


[error] Caused by: java.time.format.DateTimeParseException: Text '2019-01-27 
11:02:01.0' could not be parsed at index 20
[error] at 
java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2046)
[error] at 
java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874)
[error] at 
org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:71)
[error] ... 19 more



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark

2020-03-11 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057619#comment-17057619
 ] 

Maxim Gekk commented on SPARK-30565:


Per [~dongjoon] , default ORC reader doesn't fully cover functionality of Hive 
ORC reader. Maybe, some users have to use the former one in some cases. 

> Regression in the ORC benchmark
> ---
>
> Key: SPARK-30565
> URL: https://issues.apache.org/jira/browse/SPARK-30565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> New benchmark results generated in the PR 
> [https://github.com/apache/spark/pull/27078] show regression ~3 times.
> Before:
> {code}
> Hive built-in ORC   520531
>8  2.0 495.8   0.6X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138
> After:
> {code}
> Hive built-in ORC  1761   1792
>   43  0.61679.3   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark

2020-03-11 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057608#comment-17057608
 ] 

Wenchen Fan commented on SPARK-30565:
-

Maybe we should report this to the Hive community. In Spark, we use our native 
orc reader by default so the hive one is not very important.

> Regression in the ORC benchmark
> ---
>
> Key: SPARK-30565
> URL: https://issues.apache.org/jira/browse/SPARK-30565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> New benchmark results generated in the PR 
> [https://github.com/apache/spark/pull/27078] show regression ~3 times.
> Before:
> {code}
> Hive built-in ORC   520531
>8  2.0 495.8   0.6X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138
> After:
> {code}
> Hive built-in ORC  1761   1792
>   43  0.61679.3   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-03-11 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057592#comment-17057592
 ] 

zhengruifeng commented on SPARK-30931:
--

Thanks [~huaxingao] for your work

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-03-11 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30931.
--
Resolution: Fixed

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30935) Update MLlib, GraphX websites for 3.0

2020-03-11 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057587#comment-17057587
 ] 

zhengruifeng commented on SPARK-30935:
--

[~huaxingao] Thanks!

> Update MLlib, GraphX websites for 3.0
> -
>
> Key: SPARK-30935
> URL: https://issues.apache.org/jira/browse/SPARK-30935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30935) Update MLlib, GraphX websites for 3.0

2020-03-11 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30935.
--
Resolution: Fixed

> Update MLlib, GraphX websites for 3.0
> -
>
> Key: SPARK-30935
> URL: https://issues.apache.org/jira/browse/SPARK-30935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31011) Failed to register signal handler for PWR

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31011.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27832
[https://github.com/apache/spark/pull/27832]

> Failed to register signal handler for PWR
> -
>
> Key: SPARK-31011
> URL: https://issues.apache.org/jira/browse/SPARK-31011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Jungtaek Lim
>Priority: Minor
> Fix For: 3.1.0
>
>
> I've just tried to test something on standalone mode but the application 
> fails.
> Environment:
>  * MacOS Catalina 10.15.3 (19D76)
>  * Scala 2.12.10
>  * Java 1.8.0_241-b07
> Steps to reproduce:
>  * Compile Spark (mvn -DskipTests clean install -Dskip)
>  * ./sbin/start-master.sh
>  * ./sbin/start-slave.sh spark://host:7077
>  * submit an empty application
> Error:
> {code:java}
> 20/03/02 14:25:44 INFO SignalUtils: Registering signal handler for PWR
> 20/03/02 14:25:44 WARN SignalUtils: Failed to register signal handler for PWR
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:64)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:62)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:85)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31011) Failed to register signal handler for PWR

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31011:
-

Assignee: Jungtaek Lim

> Failed to register signal handler for PWR
> -
>
> Key: SPARK-31011
> URL: https://issues.apache.org/jira/browse/SPARK-31011
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Gabor Somogyi
>Assignee: Jungtaek Lim
>Priority: Minor
>
> I've just tried to test something on standalone mode but the application 
> fails.
> Environment:
>  * MacOS Catalina 10.15.3 (19D76)
>  * Scala 2.12.10
>  * Java 1.8.0_241-b07
> Steps to reproduce:
>  * Compile Spark (mvn -DskipTests clean install -Dskip)
>  * ./sbin/start-master.sh
>  * ./sbin/start-slave.sh spark://host:7077
>  * submit an empty application
> Error:
> {code:java}
> 20/03/02 14:25:44 INFO SignalUtils: Registering signal handler for PWR
> 20/03/02 14:25:44 WARN SignalUtils: Failed to register signal handler for PWR
> java.lang.IllegalArgumentException: Unknown signal: PWR
>   at sun.misc.Signal.(Signal.java:143)
>   at 
> org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:64)
>   at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86)
>   at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:62)
>   at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:85)
>   at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120)
>   at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
>   at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
>   at 
> org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
>   at 
> org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31032) GMM compute summary and update distributions in one pass

2020-03-11 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31032:


Assignee: zhengruifeng

> GMM compute summary and update distributions in one pass
> 
>
> Key: SPARK-31032
> URL: https://issues.apache.org/jira/browse/SPARK-31032
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> In current impl, GMM need to trigger two jobs at one iteration:
> 1, one to compute summary;
> {color:#172b4d}2, if 
> {{{color}{color:#c7a65d}{color:#172b4d}shouldDistributeGaussians}} 
> {color}{color}((k - {color:#4dacf0}1.0{color}) / k) * numFeatures > 
> {color:#4dacf0}25.0,{color}
> {color:#c7a65d}{color:#172b4d}trigger another to update distributions;{color}
> {color}
>  
> {color:#c7a65d}{color:#172b4d}shouldDistributeGaussians is almost true in 
> practice, since numFeatures is likely to be greater than 25.{color}{color}
>  
> {color:#c7a65d}{color:#172b4d}We can use only one job to impl above 
> computation,{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31032) GMM compute summary and update distributions in one pass

2020-03-11 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31032.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27784
[https://github.com/apache/spark/pull/27784]

> GMM compute summary and update distributions in one pass
> 
>
> Key: SPARK-31032
> URL: https://issues.apache.org/jira/browse/SPARK-31032
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> In current impl, GMM need to trigger two jobs at one iteration:
> 1, one to compute summary;
> {color:#172b4d}2, if 
> {{{color}{color:#c7a65d}{color:#172b4d}shouldDistributeGaussians}} 
> {color}{color}((k - {color:#4dacf0}1.0{color}) / k) * numFeatures > 
> {color:#4dacf0}25.0,{color}
> {color:#c7a65d}{color:#172b4d}trigger another to update distributions;{color}
> {color}
>  
> {color:#c7a65d}{color:#172b4d}shouldDistributeGaussians is almost true in 
> practice, since numFeatures is likely to be greater than 25.{color}{color}
>  
> {color:#c7a65d}{color:#172b4d}We can use only one job to impl above 
> computation,{color}{color}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31083) .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens

2020-03-11 Thread jiama (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057555#comment-17057555
 ] 

jiama commented on SPARK-31083:
---

Yes, I used spark2.4 of the cdh6.2 suite,  this error occurred when I used idea 
to directly commit job to yarn-client mode. Project is the maven project using 
the dependency of apache spark2.4 and spark - yarn version 2.4. The compiled 
scala is 2.11.12

> .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens
> -
>
> Key: SPARK-31083
> URL: https://issues.apache.org/jira/browse/SPARK-31083
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
> Environment: spark2.4-cdh6.2
>Reporter: jiama
>Priority: Major
>
> Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
> org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveDelegationTokens$



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31126) Upgrade Kafka to 2.4.1

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31126:
-

Assignee: Dongjoon Hyun

> Upgrade Kafka to 2.4.1
> --
>
> Key: SPARK-31126
> URL: https://issues.apache.org/jira/browse/SPARK-31126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31126) Upgrade Kafka to 2.4.1

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31126.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27881
[https://github.com/apache/spark/pull/27881]

> Upgrade Kafka to 2.4.1
> --
>
> Key: SPARK-31126
> URL: https://issues.apache.org/jira/browse/SPARK-31126
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30839) Add version information for Spark configuration

2020-03-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057524#comment-17057524
 ] 

Hyukjin Kwon commented on SPARK-30839:
--

[~beliefer] are there more jiras to add from your side? let me know when you 
finish this ticket then I resolve this ticket.

> Add version information for Spark configuration
> ---
>
> Key: SPARK-30839
> URL: https://issues.apache.org/jira/browse/SPARK-30839
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, 
> SQL, Structured Streaming, YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark ConfigEntry and ConfigBuilder missing Spark version information of each 
> configuration at release. This is not good for Spark user when they visiting 
> the page of spark configuration.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30911) Add version information to the configuration of Status

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-30911:


Assignee: jiaan.geng

> Add version information to the configuration of Status
> --
>
> Key: SPARK-30911
> URL: https://issues.apache.org/jira/browse/SPARK-30911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> core/src/main/scala/org/apache/spark/internal/config/Status.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30911) Add version information to the configuration of Status

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-30911.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27848
[https://github.com/apache/spark/pull/27848]

> Add version information to the configuration of Status
> --
>
> Key: SPARK-30911
> URL: https://issues.apache.org/jira/browse/SPARK-30911
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/Status.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31109) Add version information to the configuration of Mesos

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31109.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27863
[https://github.com/apache/spark/pull/27863]

> Add version information to the configuration of Mesos
> -
>
> Key: SPARK-31109
> URL: https://issues.apache.org/jira/browse/SPARK-31109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31109) Add version information to the configuration of Mesos

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31109:


Assignee: jiaan.geng

> Add version information to the configuration of Mesos
> -
>
> Key: SPARK-31109
> URL: https://issues.apache.org/jira/browse/SPARK-31109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Mesos
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30839) Add version information for Spark configuration

2020-03-11 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-30839:
---
Component/s: YARN
 Structured Streaming
 Spark Core
 Mesos
 Kubernetes
 DStreams
 Documentation

> Add version information for Spark configuration
> ---
>
> Key: SPARK-30839
> URL: https://issues.apache.org/jira/browse/SPARK-30839
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, 
> SQL, Structured Streaming, YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> Spark ConfigEntry and ConfigBuilder missing Spark version information of each 
> configuration at release. This is not good for Spark user when they visiting 
> the page of spark configuration.
> http://spark.apache.org/docs/latest/configuration.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31128) Fix Uncaught TypeError in streaming statistics page

2020-03-11 Thread Gengliang Wang (Jira)

Gengliang Wang created SPARK-31128:
--

 Summary: Fix Uncaught TypeError in streaming statistics page
 Key: SPARK-31128
 URL: https://issues.apache.org/jira/browse/SPARK-31128
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.0.0
Reporter: Gengliang Wang
Assignee: Gengliang Wang


There is a minor issue in https://github.com/apache/spark/pull/26201
In the streaming statistics page, there is such error 
```
streaming-page.js:211 Uncaught TypeError: Cannot read property 'top' of 
undefined
at SVGCircleElement. (streaming-page.js:211)
at SVGCircleElement.__onclick (d3.min.js:1)
```
in the console after clicking the timeline graph.
We should fix it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505
 ] 

Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM:
---

Here you go.

Avro schema:
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!


was (Author: kyrill007):
Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505
 ] 

Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM:
---

Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "value",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!


was (Author: kyrill007):
Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "cisValue",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

Here you go.

Avro schema:

 
{code:java}
{
  "type": "record",
  "namespace": "com.domain.em",
  "name": "PracticeDiff",
  "fields": [
{
  "name": "practiceId",
  "type": "string"
},
{
  "name": "cisValue",
  "type": "string"
},
{
  "name": "checkedValue",
  "type": "string"
}
  ]
} {code}
 

 

Java code:
{code:java}
package com.domain.em;


public final class PracticeDiff {

private String practiceId;
private String value;
private String checkedValue;
   
   
public String getPracticeId() {
return practiceId;
}

public String getValue() {
return value;
}

public String getCheckedValue() {
return checkedValue;
}
   
} {code}
Thank you!

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31118) Add version information to the configuration of K8S

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31118:


Assignee: jiaan.geng

> Add version information to the configuration of K8S
> ---
>
> Key: SPARK-31118
> URL: https://issues.apache.org/jira/browse/SPARK-31118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31118) Add version information to the configuration of K8S

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31118.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27875
[https://github.com/apache/spark/pull/27875]

> Add version information to the configuration of K8S
> ---
>
> Key: SPARK-31118
> URL: https://issues.apache.org/jira/browse/SPARK-31118
> Project: Spark
>  Issue Type: Sub-task
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31092) Add version information to the configuration of Yarn

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31092.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27856
[https://github.com/apache/spark/pull/27856]

> Add version information to the configuration of Yarn
> 
>
> Key: SPARK-31092
> URL: https://issues.apache.org/jira/browse/SPARK-31092
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31092) Add version information to the configuration of Yarn

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31092:


Assignee: jiaan.geng

> Add version information to the configuration of Yarn
> 
>
> Key: SPARK-31092
> URL: https://issues.apache.org/jira/browse/SPARK-31092
> Project: Spark
>  Issue Type: Sub-task
>  Components: YARN
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
>
> resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31002) Add version information to the configuration of Core

2020-03-11 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31002:


Assignee: jiaan.geng

> Add version information to the configuration of Core
> 
>
> Key: SPARK-31002
> URL: https://issues.apache.org/jira/browse/SPARK-31002
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Assignee: jiaan.geng
>Priority: Major
> Fix For: 3.1.0
>
>
> core/src/main/scala/org/apache/spark/internal/config/package.scala



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057500#comment-17057500
 ] 

Hyukjin Kwon commented on SPARK-31074:
--

Thanks, [~kyrill007]. If you already have the codes, can you paste here? So 
people just can copy and paste to reproduce

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31039) Unable to use vendor specific datatypes with JDBC (MSSQL)

2020-03-11 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057497#comment-17057497
 ] 

Hyukjin Kwon commented on SPARK-31039:
--

When you describe an issue, it's best to be specific and explicit. The fix 
might address the general issue if it is.

> Unable to use vendor specific datatypes with JDBC (MSSQL)
> -
>
> Key: SPARK-31039
> URL: https://issues.apache.org/jira/browse/SPARK-31039
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
>Reporter: Frank Oosterhuis
>Priority: Major
>
> I'm trying to create a table in MSSQL with a time(7) type.
> For this I'm using the createTableColumnTypes option like "CallStartTime 
> time(7)", with driver 
> "{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color}
> I'm getting an error:  
> {color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType 
> time(7) is not supported.(line 1, pos 43){color}
> {color:#212121}What is then the point of using this option?{color}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31110) refine sql doc for SELECT

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31110.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27866
[https://github.com/apache/spark/pull/27866]

> refine sql doc for SELECT
> -
>
> Key: SPARK-31110
> URL: https://issues.apache.org/jira/browse/SPARK-31110
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31127) Add abstract Selector

2020-03-11 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31127:
--

 Summary: Add abstract Selector
 Key: SPARK-31127
 URL: https://issues.apache.org/jira/browse/SPARK-31127
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 3.1.0
Reporter: Huaxin Gao


Add abstract Selector. Put the common code between ChisqSelector and 
FValueSelector to Selector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057472#comment-17057472
 ] 

Dongjoon Hyun commented on SPARK-25987:
---

I believe Janino is required in any way for stability, but I'm not sure what 
the other addition patches are.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31126) Upgrade Kafka to 2.4.1

2020-03-11 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31126:
-

 Summary: Upgrade Kafka to 2.4.1
 Key: SPARK-31126
 URL: https://issues.apache.org/jira/browse/SPARK-31126
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC

2020-03-11 Thread John Lonergan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057469#comment-17057469
 ] 

John Lonergan commented on SPARK-19335:
---

Streaming continous data from a fire hose source into a JDBC store.
Need batched commits for throughput, but also need batches size control to keep 
latency under control.
ie Delayed commits but not too delayed.
And I want to do this without risk of data loss in the event of losing the 
infra so the commits and checkpointing need to be aligned.
Any examples of this in practice?

Have this working in Flink with almost no effort, but would prefer Spark for 
consistency.

> Spark should support doing an efficient DataFrame Upsert via JDBC
> -
>
> Key: SPARK-19335
> URL: https://issues.apache.org/jira/browse/SPARK-19335
> Project: Spark
>  Issue Type: Improvement
>Reporter: Ilya Ganelin
>Priority: Minor
>
> Doing a database update, as opposed to an insert is useful, particularly when 
> working with streaming applications which may require revisions to previously 
> stored data. 
> Spark DataFrames/DataSets do not currently support an Update feature via the 
> JDBC Writer allowing only Overwrite or Append.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31062) Improve Spark Decommissioning K8s test relability

2020-03-11 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-31062.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

> Improve Spark Decommissioning K8s test relability
> -
>
> Key: SPARK-31062
> URL: https://issues.apache.org/jira/browse/SPARK-31062
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Assignee: Holden Karau
>Priority: Minor
> Fix For: 3.1.0
>
>
> The test currently flakes more than the other Kubernetes tests. We can remove 
> some of the timing that is likely to be a source of flakiness.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31125) When processing new K8s state snapshots Spark treats Terminating nodes as terminated.

2020-03-11 Thread Holden Karau (Jira)

Holden Karau created SPARK-31125:


 Summary: When processing new K8s state snapshots Spark treats 
Terminating nodes as terminated.
 Key: SPARK-31125
 URL: https://issues.apache.org/jira/browse/SPARK-31125
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.1.0
Reporter: Holden Karau
Assignee: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057339#comment-17057339
 ] 

L. C. Hsieh commented on SPARK-25987:
-

Hmm, thanks for checking, so currently we are not sure what patch fixes this 
issue?

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057338#comment-17057338
 ] 

Dongjoon Hyun commented on SPARK-25987:
---

Yes. I agree that that was misleading. I removed the link.

Hi, [~kiszk]. 
Do you know that fixed this at 3.0.0?

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057337#comment-17057337
 ] 

Dongjoon Hyun commented on SPARK-29183:
---

Thank you so much, [~shaneknapp]!

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-22523.
-

> Janino throws StackOverflowError on nested structs with many fields
> ---
>
> Key: SPARK-22523
> URL: https://issues.apache.org/jira/browse/SPARK-22523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: * Linux
> * Scala: 2.11.8
> * Spark: 2.2.0
>Reporter: Utku Demir
>Priority: Minor
>
> When running the below application, Janino throws StackOverflow:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> {code}
> Problematic code:
> {code:title=Example.scala|borderStyle=solid}
> import org.apache.spark.sql._
> case class Foo(
>   f1: Int = 0,
>   f2: Int = 0,
>   f3: Int = 0,
>   f4: Int = 0,
>   f5: Int = 0,
>   f6: Int = 0,
>   f7: Int = 0,
>   f8: Int = 0,
>   f9: Int = 0,
>   f10: Int = 0,
>   f11: Int = 0,
>   f12: Int = 0,
>   f13: Int = 0,
>   f14: Int = 0,
>   f15: Int = 0,
>   f16: Int = 0,
>   f17: Int = 0,
>   f18: Int = 0,
>   f19: Int = 0,
>   f20: Int = 0,
>   f21: Int = 0,
>   f22: Int = 0,
>   f23: Int = 0,
>   f24: Int = 0
> )
> case class Nest[T](
>   a: T,
>   b: T
> )
> object Nest {
>   def apply[T](t: T): Nest[T] = new Nest(t, t)
> }
> object Main {
>   def main(args: Array[String]) {
> val spark: SparkSession = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import spark.implicits._
> val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0)
> Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => 
> s).collect
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-22761) 64KB JVM bytecode limit problem with GLM

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-22761.
-

> 64KB JVM bytecode limit problem with GLM
> 
>
> Key: SPARK-22761
> URL: https://issues.apache.org/jira/browse/SPARK-22761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Alan Lai
>Priority: Major
>
> {code:java}
> GLM
> {code} (presumably other mllib tools)
>  can throw an exception due to the 64KB JVM bytecode limit when they use with 
> a lot of variables/arguments (~ 2k).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22523:
--
Parent: SPARK-22510
Issue Type: Sub-task  (was: Bug)

> Janino throws StackOverflowError on nested structs with many fields
> ---
>
> Key: SPARK-22523
> URL: https://issues.apache.org/jira/browse/SPARK-22523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 2.2.0
> Environment: * Linux
> * Scala: 2.11.8
> * Spark: 2.2.0
>Reporter: Utku Demir
>Priority: Minor
>
> When running the below application, Janino throws StackOverflow:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
>   at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541)
> {code}
> Problematic code:
> {code:title=Example.scala|borderStyle=solid}
> import org.apache.spark.sql._
> case class Foo(
>   f1: Int = 0,
>   f2: Int = 0,
>   f3: Int = 0,
>   f4: Int = 0,
>   f5: Int = 0,
>   f6: Int = 0,
>   f7: Int = 0,
>   f8: Int = 0,
>   f9: Int = 0,
>   f10: Int = 0,
>   f11: Int = 0,
>   f12: Int = 0,
>   f13: Int = 0,
>   f14: Int = 0,
>   f15: Int = 0,
>   f16: Int = 0,
>   f17: Int = 0,
>   f18: Int = 0,
>   f19: Int = 0,
>   f20: Int = 0,
>   f21: Int = 0,
>   f22: Int = 0,
>   f23: Int = 0,
>   f24: Int = 0
> )
> case class Nest[T](
>   a: T,
>   b: T
> )
> object Nest {
>   def apply[T](t: T): Nest[T] = new Nest(t, t)
> }
> object Main {
>   def main(args: Array[String]) {
> val spark: SparkSession = 
> SparkSession.builder().appName("test").master("local[*]").getOrCreate()
> import spark.implicits._
> val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
> 0, 0, 0, 0)
> Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => 
> s).collect
>   }
> }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2020-03-11 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057335#comment-17057335
 ] 

Shane Knapp commented on SPARK-29183:
-

i'll get to this later this week/early next.

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6

2020-03-11 Thread Shane Knapp (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shane Knapp reassigned SPARK-29183:
---

Assignee: Shane Knapp

> Upgrade JDK 11 Installation to 11.0.6
> -
>
> Key: SPARK-29183
> URL: https://issues.apache.org/jira/browse/SPARK-29183
> Project: Spark
>  Issue Type: Improvement
>  Components: Project Infra
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Shane Knapp
>Priority: Major
>
> Every JDK 11.0.x releases have many fixes including performance regression 
> fix. We had better upgrade it to the latest 11.0.4.
> - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057334#comment-17057334
 ] 

Dongjoon Hyun commented on SPARK-25987:
---

Unfortunately, it seems that we need more patches from `branch-3.0`. With only 
Janino 3.0.11 on `branch-2.4`, it fails.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057330#comment-17057330
 ] 

L. C. Hsieh commented on SPARK-25987:
-

Because I'm not sure how this got fixed. I can only see this is superceded by 
"SPARK-26298 Upgrade Janino version to 3.0.11", so I'm wondering if upgrading 
Janino can just fix this.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057327#comment-17057327
 ] 

Dongjoon Hyun commented on SPARK-25987:
---

Do you mean in `branch-2.4`? Let me check that quickly.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31077) Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel

2020-03-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31077:


Assignee: Huaxin Gao

> Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
> ---
>
> Key: SPARK-31077
> URL: https://issues.apache.org/jira/browse/SPARK-31077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
>
> Currently, ChiSqSelector depends on mllib.ChiSqSelectorModel. Remove this 
> dependency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31077) Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel

2020-03-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31077.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27841
[https://github.com/apache/spark/pull/27841]

> Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
> ---
>
> Key: SPARK-31077
> URL: https://issues.apache.org/jira/browse/SPARK-31077
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.1.0
>
>
> Currently, ChiSqSelector depends on mllib.ChiSqSelectorModel. Remove this 
> dependency. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31095) Upgrade netty-all to 4.1.47.Final

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31095:
--
Fix Version/s: 2.4.6

> Upgrade netty-all to 4.1.47.Final
> -
>
> Key: SPARK-31095
> URL: https://issues.apache.org/jira/browse/SPARK-31095
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Vishwas Vijaya Kumar
>Assignee: Dongjoon Hyun
>Priority: Major
>  Labels: security
> Fix For: 3.0.0, 2.4.6
>
>
> Upgrade version of io.netty_netty-all to 4.1.44.Final 
> [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057309#comment-17057309
 ] 

L. C. Hsieh commented on SPARK-25987:
-

Thanks [~dongjoon]. So upgrading Janino can fix this, right?

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057301#comment-17057301
 ] 

Dongjoon Hyun edited comment on SPARK-25987 at 3/11/20, 6:29 PM:
-

Thank you for commenting, [~viirya].
I confirmed that the above example is fixed at 3.0.0-preview2 while 2.4.5 still 
has this bug.


was (Author: dongjoon):
Thank you for commenting, [~viirya].
I confirmed that this is fixed at 3.0.0-preview2 while 2.4.5 still has this bug.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25987.
---
Resolution: Duplicate

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25987:
--
Affects Version/s: (was: 3.0.0)
   2.4.5

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057301#comment-17057301
 ] 

Dongjoon Hyun commented on SPARK-25987:
---

Thank you for commenting, [~viirya].
I confirmed that this is fixed at 3.0.0-preview2 while 2.4.5 still has this bug.

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057283#comment-17057283
 ] 

Dongjoon Hyun commented on SPARK-29295:
---

I confirmed that Apache Spark 2.1.3 and older versions have no problem.

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29295:
--
Affects Version/s: (was: 2.4.4)
   2.4.5

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.5
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}
> {code}
> create external table test(id int) partitioned by (name string) stored as 
> parquet location '/tmp/p';
> set spark.sql.hive.convertMetastoreParquet=false;
> insert overwrite table test partition(name='n1') select 1;
> ALTER TABLE test DROP PARTITION(name='n1');
> insert overwrite table test partition(name='n1') select 2;
> select id from test where name = 'n1' order by id;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29295:
--
Description: 
When we drop a partition of a external table and then overwrite it, if we set 
CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition.
But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result.

Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
module):

{code:java}
  test("spark gives duplicate result when dropping a partition of an external 
partitioned table" +
" firstly and they overwrite it") {
withTable("test") {
  withTempDir { f =>
sql("create external table test(id int) partitioned by (name string) 
stored as " +
  s"parquet location '${f.getAbsolutePath}'")

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(1), Row(2)))
}

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(2)))
}
  }
}
  }
{code}

{code}
create external table test(id int) partitioned by (name string) stored as 
parquet location '/tmp/p';
set spark.sql.hive.convertMetastoreParquet=false;
insert overwrite table test partition(name='n1') select 1;
ALTER TABLE test DROP PARTITION(name='n1');
insert overwrite table test partition(name='n1') select 2;
select id from test where name = 'n1' order by id;
{code}

  was:
When we drop a partition of a external table and then overwrite it, if we set 
CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition.
But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result.

Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
module):

{code:java}
  test("spark gives duplicate result when dropping a partition of an external 
partitioned table" +
" firstly and they overwrite it") {
withTable("test") {
  withTempDir { f =>
sql("create external table test(id int) partitioned by (name string) 
stored as " +
  s"parquet location '${f.getAbsolutePath}'")

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(1), Row(2)))
}

withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) {
  sql("insert overwrite table test partition(name='n1') select 1")
  sql("ALTER TABLE test DROP PARTITION(name='n1')")
  sql("insert overwrite table test partition(name='n1') select 2")
  checkAnswer( sql("select id from test where name = 'n1' order by id"),
Array(Row(2)))
}
  }
}
  }
{code}



> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key ->

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29295:
--
Affects Version/s: 2.2.3

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.3, 2.3.4, 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns

2020-03-11 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057280#comment-17057280
 ] 

L. C. Hsieh commented on SPARK-25987:
-

Looks like Janino was upgraded, is this still an issue in 3.0?

> StackOverflowError when executing many operations on a table with many columns
> --
>
> Key: SPARK-25987
> URL: https://issues.apache.org/jira/browse/SPARK-25987
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0
> Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181"
>Reporter: Ivan Tsukanov
>Priority: Major
>
> When I execute
> {code:java}
> import org.apache.spark.sql._
> import org.apache.spark.sql.types._
> val columnsCount = 100
> val columns = (1 to columnsCount).map(i => s"col$i")
> val initialData = (1 to columnsCount).map(i => s"val$i")
> val df = spark.createDataFrame(
>   rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))),
>   schema = StructType(columns.map(StructField(_, StringType, true)))
> )
> val addSuffixUDF = udf(
>   (str: String) => str + "_added"
> )
> implicit class DFOps(df: DataFrame) {
>   def addSuffix() = {
> df.select(columns.map(col =>
>   addSuffixUDF(df(col)).as(col)
> ): _*)
>   }
> }
> df.addSuffix().addSuffix().addSuffix().show()
> {code}
> I get
> {code:java}
> An exception or error caused a run to abort.
> java.lang.StackOverflowError
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385)
>  at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553)
> ...
> {code}
> If I reduce columns number (to 10 for example) or do `addSuffix` only once - 
> it works fine.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29295:
--
Affects Version/s: 2.3.4

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057260#comment-17057260
 ] 

Dongjoon Hyun commented on SPARK-29295:
---

I marked this as a `correctness` issue.

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057261#comment-17057261
 ] 

Dongjoon Hyun commented on SPARK-29295:
---

Hi, [~viirya]. Could you make a backport against branch-2.4?

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29295:
--
Labels: correctness  (was: )

> Duplicate result when dropping partition of an external table and then 
> overwriting
> --
>
> Key: SPARK-29295
> URL: https://issues.apache.org/jira/browse/SPARK-29295
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Assignee: L. C. Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> When we drop a partition of a external table and then overwrite it, if we set 
> CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this 
> partition.
> But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate 
> result.
> Here is a reproduce code below(you can add it into SQLQuerySuite in hive 
> module):
> {code:java}
>   test("spark gives duplicate result when dropping a partition of an external 
> partitioned table" +
> " firstly and they overwrite it") {
> withTable("test") {
>   withTempDir { f =>
> sql("create external table test(id int) partitioned by (name string) 
> stored as " +
>   s"parquet location '${f.getAbsolutePath}'")
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> 
> false.toString) {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(1), Row(2)))
> }
> withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) 
> {
>   sql("insert overwrite table test partition(name='n1') select 1")
>   sql("ALTER TABLE test DROP PARTITION(name='n1')")
>   sql("insert overwrite table test partition(name='n1') select 2")
>   checkAnswer( sql("select id from test where name = 'n1' order by 
> id"),
> Array(Row(2)))
> }
>   }
> }
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF

2020-03-11 Thread hemanth meka (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057255#comment-17057255
 ] 

hemanth meka commented on SPARK-30989:
--

The alias "cat" is defined as a dataframe having 2 columns "x" and "y". The 
column "z" is generated from "cat" into a new dataframe "df2" but below code 
works and hence this exception looks like it should be the expected behaviour. 
is it not?
df2.select("z")
 

> TABLE.COLUMN reference doesn't work with new columns created by UDF
> ---
>
> Key: SPARK-30989
> URL: https://issues.apache.org/jira/browse/SPARK-30989
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Chris Suchanek
>Priority: Major
>
> When a dataframe is created with an alias (`.as("...")`) its columns can be 
> referred as `TABLE.COLUMN` but it doesn't work for newly created columns with 
> UDF.
> {code:java}
> // code placeholder
> df1 = sc.parallelize(l).toDF("x","y").as("cat")
> val squared = udf((s: Int) => s * s)
> val df2 = df1.withColumn("z", squared(col("y")))
> df2.columns //Array[String] = Array(x, y, z)
> df2.select("cat.x") // works
> df2.select("cat.z") // Doesn't work
> // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given 
> input 
> // columns: [cat.x, cat.y, z];;
> {code}
> Might be related to: https://issues.apache.org/jira/browse/SPARK-30532



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25193:
--
Parent: SPARK-30034
Issue Type: Sub-task  (was: Bug)

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057252#comment-17057252
 ] 

Dongjoon Hyun edited comment on SPARK-25193 at 3/11/20, 5:37 PM:
-

This is fixed at 3.0.0 via SPARK-30034 after SPARK-23710 


was (Author: dongjoon):
This is fixed at 3.0.0 via SPARK-23710 

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25193.
---
Resolution: Duplicate

This is fixed at 3.0.0 via SPARK-23710 

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-25193:
---

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057251#comment-17057251
 ] 

Dongjoon Hyun commented on SPARK-25193:
---

I marked this as a correctness because the result after insertion will be 
incorrect due to the old data.

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25193:
--
Labels: correctness  (was: bulk-closed)

> insert overwrite doesn't throw exception when drop old data fails
> -
>
> Key: SPARK-25193
> URL: https://issues.apache.org/jira/browse/SPARK-25193
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: chen xiao
>Priority: Major
>  Labels: correctness
>
> dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName")
> Insert overwrite mode will drop old data in hive table if there's old data.
> But if data deleting fails, no exception will be thrown and the data folder 
> will be like:
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0
> hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513.
> Two copies of data will be kept.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31099) Create migration script for metastore_db

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31099:
--
Parent: SPARK-30034
Issue Type: Sub-task  (was: Improvement)

> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
>   at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184)
>   at 
> org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setC

[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark

2020-03-11 Thread Peter Toth (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057224#comment-17057224
 ] 

Peter Toth commented on SPARK-30565:


I looked into this and the performance drop is due to the 1.2.1 -> 2.3.6 Hive 
version change we introduced in Spark 3. I measured that 
{{org.apache.hadoop.hive.ql.io.orc.ReaderImpl}} in {{hive-exec-2.3.6-core.jar}} 
is ~3-5 times slower than in {{hive-exec-1.2.1.spark2.jar}}.

> Regression in the ORC benchmark
> ---
>
> Key: SPARK-30565
> URL: https://issues.apache.org/jira/browse/SPARK-30565
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> New benchmark results generated in the PR 
> [https://github.com/apache/spark/pull/27078] show regression ~3 times.
> Before:
> {code}
> Hive built-in ORC   520531
>8  2.0 495.8   0.6X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138
> After:
> {code}
> Hive built-in ORC  1761   1792
>   43  0.61679.3   0.1X
> {code}
> https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31099) Create migration script for metastore_db

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057220#comment-17057220
 ] 

Dongjoon Hyun commented on SPARK-31099:
---

To [~kabhwan]. Yes. I mean that corner cases. It's the same here. We can remove 
local derby files.
To [~rednaxelafx]. For remote HMS, we have `spark.sql.hive.metastore.version`. 
And, SPARK-27686 was the issue for `Update migration guide for make Hive 2.3 
dependency by default`.
To [~cloud_fan], yes. This scope of this issue is for local hive metastore. For 
remote HMS, we should follow up at SPARK-27686.

cc [~smilegator] and [~yumwang] since we worked together at SPARK-27686.

cc [~rxin] since he is a release manager for 3.0.0.



> Create migration script for metastore_db
> 
>
> Key: SPARK-31099
> URL: https://issues.apache.org/jira/browse/SPARK-31099
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> When an existing Derby database exists (in ./metastore_db) created by Hive 
> 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile.
> Repro steps:
> 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive 
> -Phive-thriftserver. Make sure there's no existing ./metastore_db directory 
> in the repo.
> 2. Run bin/spark-shell, and then spark.sql("show databases"). This will 
> populate the ./metastore_db directory, where the Derby-based metastore 
> database is hosted. This database is populated from Hive 1.2.x.
> 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops 
> the Hive 1.2 profile, which makes it use the default Hive 2.3 profile)
> 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby 
> database created in Step (2), which triggers an upgrade step, and that's 
> where the following error will be reported.
> 5. Delete the ./metastore_db and re-run Step (4). The error is no longer 
> reported.
> {code:java}
> 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS 
> ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN 
> ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has 
> been specified as NOT NULL and either the DEFAULT clause was not specified or 
> was specified as DEFAULT NULL.
> java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column 
> 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT 
> clause was not specified or was specified as DEFAULT NULL.
>   at 
> org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
>   at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown 
> Source)
>   at 
> org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown 
> Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source)
>   at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879)
>   at 
> org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830)
>   at 
> org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896)
>   at 
> org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672)
>   at 
> org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865)
>   at 
> org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347)
>   at org.datanucleus.store.query.Query.executeQuery(Query.java:1816)
>   at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744)
>   at org.datanucleus.store.query.Query.execute(Query.java:1726)
>   at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)

[jira] [Closed] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun closed SPARK-24640.
-

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057207#comment-17057207
 ] 

Dongjoon Hyun edited comment on SPARK-24640 at 3/11/20, 4:57 PM:
-

This is reverted via https://github.com/apache/spark/pull/27834


was (Author: dongjoon):
This is reverted.

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24640:
--
Fix Version/s: (was: 3.0.0)

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-24640:
-

Assignee: (was: Maxim Gekk)

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
> Fix For: 3.0.0
>
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24640.
---
Resolution: Won't Do

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Priority: Major
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24640) size(null) returns null

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-24640:
---

This is reverted.

> size(null) returns null 
> 
>
> Key: SPARK-24640
> URL: https://issues.apache.org/jira/browse/SPARK-24640
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Xiao Li
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Size(null) should return null instead of -1 in 3.0 release. This is a 
> behavior change. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31091) Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31091.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27834
[https://github.com/apache/spark/pull/27834]

> Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"
> ---
>
> Key: SPARK-31091
> URL: https://issues.apache.org/jira/browse/SPARK-31091
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-11 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31076:
--
Labels: correctness  (was: )

> Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
> 
>
> Key: SPARK-31076
> URL: https://issues.apache.org/jira/browse/SPARK-31076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>  Labels: correctness
> Fix For: 3.0.0
>
>
> By default, collect() returns java.sql.Timestamp/Date instances with offsets 
> derived from internal values of Catalyst's TIMESTAMP/DATE that store 
> microseconds since the epoch. The conversion from internal values to 
> java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
> the resulted values before 1582 year to strings produces timestamp/date 
> string in Julian calendar. For example:
> {code}
> scala> sql("select date '1100-10-10'").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
> {code} 
> This can be fixed if internal Catalyst's values are converted to local 
> date-time in Gregorian calendar,  and construct local date-time from the 
> resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30935) Update MLlib, GraphX websites for 3.0

2020-03-11 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057154#comment-17057154
 ] 

Huaxin Gao commented on SPARK-30935:


cc [~podongfeng]
I think all the docs are OK now. This can be marked as complete. 

> Update MLlib, GraphX websites for 3.0
> -
>
> Key: SPARK-30935
> URL: https://issues.apache.org/jira/browse/SPARK-30935
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Critical
>
> Update the sub-projects' websites to include new features in this release.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage

2020-03-11 Thread Huaxin Gao (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057146#comment-17057146
 ] 

Huaxin Gao commented on SPARK-30931:


cc [~podongfeng]
I didn't see anything else need to be changed. This ticket can be marked as 
complete. 

> ML 3.0 QA: API: Python API coverage
> ---
>
> Key: SPARK-30931
> URL: https://issues.apache.org/jira/browse/SPARK-30931
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Major
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
>  * *GOAL*: Audit and create JIRAs to fix in the next release.
>  * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
>  * Inconsistency: Do class/method/parameter names match?
>  * Docs: Is the Python doc missing or just a stub? We want the Python doc to 
> be as complete as the Scala doc.
>  * API breaking changes: These should be very rare but are occasionally 
> either necessary (intentional) or accidental. These must be recorded and 
> added in the Migration Guide for this release.
>  ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
>  * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle. 
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31117) reduce the test time of DateTimeUtilsSuite

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31117.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27873
[https://github.com/apache/spark/pull/27873]

> reduce the test time of DateTimeUtilsSuite
> --
>
> Key: SPARK-31117
> URL: https://issues.apache.org/jira/browse/SPARK-31117
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31124) change the default value of minPartitionNum in AQE

2020-03-11 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31124:
---

 Summary: change the default value of minPartitionNum in AQE
 Key: SPARK-31124
 URL: https://issues.apache.org/jira/browse/SPARK-31124
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31081) Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI

2020-03-11 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31081:
-
Summary: Make the display of stageId/stageAttemptId/taskId of sql metrics 
configurable in UI   (was: Make SQLMetrics more readable from UI)

> Make the display of stageId/stageAttemptId/taskId of sql metrics configurable 
> in UI 
> 
>
> Key: SPARK-31081
> URL: https://issues.apache.org/jira/browse/SPARK-31081
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: wuyi
>Priority: Major
>
> It makes metrics harder to read after SPARK-30209 and user may not interest 
> in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31041.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27800
[https://github.com/apache/spark/pull/27800]

> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Trivial
> Fix For: 3.1.0
>
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  That's because Maven is accepting {{--pip}} as an option and failing, but 
> the user doesn't get to see the error from Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31041) Show Maven errors from within make-distribution.sh

2020-03-11 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31041:


Assignee: Nicholas Chammas

> Show Maven errors from within make-distribution.sh
> --
>
> Key: SPARK-31041
> URL: https://issues.apache.org/jira/browse/SPARK-31041
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Trivial
>
> This works:
> {code:java}
> ./dev/make-distribution.sh \
>  --pip \
>  -Phadoop-2.7 -Phive -Phadoop-cloud {code}
>  
>  But this doesn't:
> {code:java}
>  ./dev/make-distribution.sh \
>  -Phadoop-2.7 -Phive -Phadoop-cloud \
>  --pip{code}
>  
> The latter invocation yields the following, confusing output:
> {code:java}
>  + VERSION=' -X,--debug Produce execution debug output'{code}
>  That's because Maven is accepting {{--pip}} as an option and failing, but 
> the user doesn't get to see the error from Maven.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31076:
---

Assignee: Maxim Gekk

> Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
> 
>
> Key: SPARK-31076
> URL: https://issues.apache.org/jira/browse/SPARK-31076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> By default, collect() returns java.sql.Timestamp/Date instances with offsets 
> derived from internal values of Catalyst's TIMESTAMP/DATE that store 
> microseconds since the epoch. The conversion from internal values to 
> java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
> the resulted values before 1582 year to strings produces timestamp/date 
> string in Julian calendar. For example:
> {code}
> scala> sql("select date '1100-10-10'").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
> {code} 
> This can be fixed if internal Catalyst's values are converted to local 
> date-time in Gregorian calendar,  and construct local date-time from the 
> resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31076.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27807
[https://github.com/apache/spark/pull/27807]

> Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
> 
>
> Key: SPARK-31076
> URL: https://issues.apache.org/jira/browse/SPARK-31076
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> By default, collect() returns java.sql.Timestamp/Date instances with offsets 
> derived from internal values of Catalyst's TIMESTAMP/DATE that store 
> microseconds since the epoch. The conversion from internal values to 
> java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting 
> the resulted values before 1582 year to strings produces timestamp/date 
> string in Julian calendar. For example:
> {code}
> scala> sql("select date '1100-10-10'").collect()
> res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
> {code} 
> This can be fixed if internal Catalyst's values are converted to local 
> date-time in Gregorian calendar,  and construct local date-time from the 
> resulted year, month, ..., seconds in Julian calendar.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31123) Drop does not work after join with aliases

2020-03-11 Thread Mikel San Vicente (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikel San Vicente updated SPARK-31123:
--
Description: 
 

Hi,

I am seeing a really strange behaviour in drop method after a join with 
aliases. It doesn't seem to find the column when I reference to it using 
dataframe("columnName") syntax, but it does work with other combinators like 
select
{code:java}
case class Record(a: String, dup: String)
case class Record2(b: String, dup: String)
val df = Seq(Record("a", "dup")).toDF
val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b"))
val dupCol = df("dup")
joined.drop(dupCol) // Does not drop anything
joined.drop(func.col("a.dup")) // It drops the column  
joined.select(dupCol) // It selects the column
{code}
 

 

 

  was:
 

Hi,

I am seeing a really strange behaviour in drop method after a join with 
aliases. It doesn't seem to find the column when I reference to it using 
dataframe("columnName") syntax, but it does work with other combinators like 
select
{code:java}
case class Record(a: String, dup: String)
case class Record2(b: String, dup: String)
val df = Seq(Record("a", "dup")).toDF
val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b"))
val dupCol = df("dup")
joined.drop(dupCol) // Does not drop anything
joined.drop(func.col("a.dup")) // It works!  
joined.select(dupCol) // It works!
{code}
 

 

 


> Drop does not work after join with aliases
> --
>
> Key: SPARK-31123
> URL: https://issues.apache.org/jira/browse/SPARK-31123
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.2
>Reporter: Mikel San Vicente
>Priority: Minor
>
>  
> Hi,
> I am seeing a really strange behaviour in drop method after a join with 
> aliases. It doesn't seem to find the column when I reference to it using 
> dataframe("columnName") syntax, but it does work with other combinators like 
> select
> {code:java}
> case class Record(a: String, dup: String)
> case class Record2(b: String, dup: String)
> val df = Seq(Record("a", "dup")).toDF
> val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b"))
> val dupCol = df("dup")
> joined.drop(dupCol) // Does not drop anything
> joined.drop(func.col("a.dup")) // It drops the column  
> joined.select(dupCol) // It selects the column
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31123) Drop does not work after join with aliases

2020-03-11 Thread Mikel San Vicente (Jira)

Mikel San Vicente created SPARK-31123:
-

 Summary: Drop does not work after join with aliases
 Key: SPARK-31123
 URL: https://issues.apache.org/jira/browse/SPARK-31123
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.2
Reporter: Mikel San Vicente


 

Hi,

I am seeing a really strange behaviour in drop method after a join with 
aliases. It doesn't seem to find the column when I reference to it using 
dataframe("columnName") syntax, but it does work with other combinators like 
select
{code:java}
case class Record(a: String, dup: String)
case class Record2(b: String, dup: String)
val df = Seq(Record("a", "dup")).toDF
val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b"))
val dupCol = df("dup")
joined.drop(dupCol) // Does not drop anything
joined.drop(func.col("a.dup")) // It works!  
joined.select(dupCol) // It works!
{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column

2020-03-11 Thread Kyrill Alyoshin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056972#comment-17056972
 ] 

Kyrill Alyoshin commented on SPARK-31074:
-

Yes,
 # Create a simple Avro schema file with 2 properties in it '*f1*' and '*f2*' - 
their types can be Strings.
 # Create a Spark dataframe with two fields in it '*f1*' and '*f2*' of String 
type that are *nullable*.
 # Write out this dataframe to a file using the Avro schema create in 1 through 
'{{avroSchema}}' option.

 

> Avro serializer should not fail when a nullable Spark field is written to a 
> non-null Avro column
> 
>
> Key: SPARK-31074
> URL: https://issues.apache.org/jira/browse/SPARK-31074
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Kyrill Alyoshin
>Priority: Major
>
> Spark StructType schema are strongly biased towards having _nullable_ fields. 
> In fact, this is what _Encoders.bean()_ does - any non-primitive field is 
> automatically _nullable_. When we attempt to serialize dataframes into 
> *user-supplied* Avro schemas where such corresponding fields are marked as 
> _non-null_ (i.e., they are not of _union_ type) any such attempt will fail 
> with the following exception
>  
> {code:java}
> Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string"
>   at org.apache.avro.Schema.getTypes(Schema.java:299)
>   at 
> org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229)
>   at 
> org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209)
>  {code}
> This seems as rather draconian. We certainly should be able to write a field 
> of the same type and with the same name if it is not a null into a 
> non-nullable Avro column. In fact, the problem is so *severe* that it is not 
> clear what should be done in such situations when Avro schema is given to you 
> as part of API communication contract (i.e., it is non-changeable).
> This is an important issue.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime

2020-03-11 Thread Yuanjian Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-31030:

Parent: SPARK-26904
Issue Type: Sub-task  (was: Improvement)

> Backward Compatibility for Parsing and Formatting Datetime
> --
>
> Key: SPARK-31030
> URL: https://issues.apache.org/jira/browse/SPARK-31030
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2020-03-04-10-54-05-208.png, 
> image-2020-03-04-10-54-13-238.png
>
>
> *Background*
> In Spark version 2.4 and earlier, datetime parsing, formatting and conversion 
> are performed by using the hybrid calendar ([Julian + 
> Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]).
>  
> Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as 
> well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by 
> using Java 8 API classes (the java.time packages that are based on [ISO 
> chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html]
>  ).
> The switching job is completed in SPARK-26651. 
>  
> *Problem*
> Switching to Java 8 datetime API breaks the backward compatibility of Spark 
> 2.4 and earlier when parsing datetime. Spark need its own patters definition 
> on datetime parsing and formatting.
>  
> *Solution*
> To avoid unexpected result changes after the underlying datetime API switch, 
> we propose the following solution. 
>  * Introduce the fallback mechanism: when the Java 8-based parser fails, we 
> need to detect these behavior differences by falling back to the legacy 
> parser, and fail with a user-friendly error message to tell users what gets 
> changed and how to fix the pattern.
>  * Document the Spark’s datetime patterns: The date-time formatter of Spark 
> is decoupled with the Java patterns. The Spark’s patterns are mainly based on 
> the [Java 7’s 
> pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html]
>  (for better backward compatibility) with the customized logic (caused by the 
> breaking changes between [Java 
> 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] 
> and [Java 
> 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html]
>  pattern string). Below are the customized rules:
> ||Pattern||Java 7||Java 8|| Example||Rule||
> |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u 
> accept a negative value to represent BC, while y should be used together with 
> G to do the same thing.)|!image-2020-03-04-10-54-05-208.png!  |Substitute ‘u’ 
> to ‘e’ and use Java 8 parser to parse the string. If parsable, return the 
> result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to 
> parse. When it is successfully parsed, throw an exception and ask users to 
> change the pattern strings or turn on the legacy mode; otherwise, return NULL 
> as what Spark 2.4 does.|
> | z| General time zone which also accepts
>  [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. 
> Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png!  |The 
> semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 
> follows the semantics of Java 8. 
>  Use Java 8 to parse the string. If parsable, return the result; otherwise, 
> use the legacy Java 7 parser to parse. When it is successfully parsed, throw 
> an exception and ask users to change the pattern strings or turn on the 
> legacy mode; otherwise, return NULL as what Spark 2.4 does.|
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Deleted] (SPARK-31031) Backward Compatibility for Parsing Datetime

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan deleted SPARK-31031:



> Backward Compatibility for Parsing Datetime
> ---
>
> Key: SPARK-31031
> URL: https://issues.apache.org/jira/browse/SPARK-31031
> Project: Spark
>  Issue Type: Sub-task
>Reporter: Yuanjian Li
>Priority: Major
>
> Mirror issue for SPARK-31030, because of the sub-task can't add sub-task.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31111) Fix interval output issue in ExtractBenchmark

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-3.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27867
[https://github.com/apache/spark/pull/27867]

> Fix interval output issue in ExtractBenchmark 
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31111) Fix interval output issue in ExtractBenchmark

2020-03-11 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-3:
---

Assignee: Kent Yao

> Fix interval output issue in ExtractBenchmark 
> --
>
> Key: SPARK-3
> URL: https://issues.apache.org/jira/browse/SPARK-3
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 144 matches

Mail list logo