[jira] [Created] (SPARK-31129) IntervalBenchmark and DateTimeBenchmark fails to run
Kent Yao created SPARK-31129: Summary: IntervalBenchmark and DateTimeBenchmark fails to run Key: SPARK-31129 URL: https://issues.apache.org/jira/browse/SPARK-31129 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao [error] Caused by: java.time.format.DateTimeParseException: Text '2019-01-27 11:02:01.0' could not be parsed at index 20 [error] at java.base/java.time.format.DateTimeFormatter.parseResolved0(DateTimeFormatter.java:2046) [error] at java.base/java.time.format.DateTimeFormatter.parse(DateTimeFormatter.java:1874) [error] at org.apache.spark.sql.catalyst.util.Iso8601TimestampFormatter.$anonfun$parse$1(TimestampFormatter.scala:71) [error] ... 19 more -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark
[ https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057619#comment-17057619 ] Maxim Gekk commented on SPARK-30565: Per [~dongjoon] , default ORC reader doesn't fully cover functionality of Hive ORC reader. Maybe, some users have to use the former one in some cases. > Regression in the ORC benchmark > --- > > Key: SPARK-30565 > URL: https://issues.apache.org/jira/browse/SPARK-30565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > New benchmark results generated in the PR > [https://github.com/apache/spark/pull/27078] show regression ~3 times. > Before: > {code} > Hive built-in ORC 520531 >8 2.0 495.8 0.6X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138 > After: > {code} > Hive built-in ORC 1761 1792 > 43 0.61679.3 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark
[ https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057608#comment-17057608 ] Wenchen Fan commented on SPARK-30565: - Maybe we should report this to the Hive community. In Spark, we use our native orc reader by default so the hive one is not very important. > Regression in the ORC benchmark > --- > > Key: SPARK-30565 > URL: https://issues.apache.org/jira/browse/SPARK-30565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > New benchmark results generated in the PR > [https://github.com/apache/spark/pull/27078] show regression ~3 times. > Before: > {code} > Hive built-in ORC 520531 >8 2.0 495.8 0.6X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138 > After: > {code} > Hive built-in ORC 1761 1792 > 43 0.61679.3 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057592#comment-17057592 ] zhengruifeng commented on SPARK-30931: -- Thanks [~huaxingao] for your work > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30931. -- Resolution: Fixed > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30935) Update MLlib, GraphX websites for 3.0
[ https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057587#comment-17057587 ] zhengruifeng commented on SPARK-30935: -- [~huaxingao] Thanks! > Update MLlib, GraphX websites for 3.0 > - > > Key: SPARK-30935 > URL: https://issues.apache.org/jira/browse/SPARK-30935 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30935) Update MLlib, GraphX websites for 3.0
[ https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30935. -- Resolution: Fixed > Update MLlib, GraphX websites for 3.0 > - > > Key: SPARK-30935 > URL: https://issues.apache.org/jira/browse/SPARK-30935 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31011) Failed to register signal handler for PWR
[ https://issues.apache.org/jira/browse/SPARK-31011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31011. --- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27832 [https://github.com/apache/spark/pull/27832] > Failed to register signal handler for PWR > - > > Key: SPARK-31011 > URL: https://issues.apache.org/jira/browse/SPARK-31011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Jungtaek Lim >Priority: Minor > Fix For: 3.1.0 > > > I've just tried to test something on standalone mode but the application > fails. > Environment: > * MacOS Catalina 10.15.3 (19D76) > * Scala 2.12.10 > * Java 1.8.0_241-b07 > Steps to reproduce: > * Compile Spark (mvn -DskipTests clean install -Dskip) > * ./sbin/start-master.sh > * ./sbin/start-slave.sh spark://host:7077 > * submit an empty application > Error: > {code:java} > 20/03/02 14:25:44 INFO SignalUtils: Registering signal handler for PWR > 20/03/02 14:25:44 WARN SignalUtils: Failed to register signal handler for PWR > java.lang.IllegalArgumentException: Unknown signal: PWR > at sun.misc.Signal.(Signal.java:143) > at > org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:64) > at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) > at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:62) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:85) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31011) Failed to register signal handler for PWR
[ https://issues.apache.org/jira/browse/SPARK-31011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31011: - Assignee: Jungtaek Lim > Failed to register signal handler for PWR > - > > Key: SPARK-31011 > URL: https://issues.apache.org/jira/browse/SPARK-31011 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Gabor Somogyi >Assignee: Jungtaek Lim >Priority: Minor > > I've just tried to test something on standalone mode but the application > fails. > Environment: > * MacOS Catalina 10.15.3 (19D76) > * Scala 2.12.10 > * Java 1.8.0_241-b07 > Steps to reproduce: > * Compile Spark (mvn -DskipTests clean install -Dskip) > * ./sbin/start-master.sh > * ./sbin/start-slave.sh spark://host:7077 > * submit an empty application > Error: > {code:java} > 20/03/02 14:25:44 INFO SignalUtils: Registering signal handler for PWR > 20/03/02 14:25:44 WARN SignalUtils: Failed to register signal handler for PWR > java.lang.IllegalArgumentException: Unknown signal: PWR > at sun.misc.Signal.(Signal.java:143) > at > org.apache.spark.util.SignalUtils$.$anonfun$register$1(SignalUtils.scala:64) > at scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) > at org.apache.spark.util.SignalUtils$.register(SignalUtils.scala:62) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.onStart(CoarseGrainedExecutorBackend.scala:85) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:120) > at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203) > at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100) > at > org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75) > at > org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31032) GMM compute summary and update distributions in one pass
[ https://issues.apache.org/jira/browse/SPARK-31032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31032: Assignee: zhengruifeng > GMM compute summary and update distributions in one pass > > > Key: SPARK-31032 > URL: https://issues.apache.org/jira/browse/SPARK-31032 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > > In current impl, GMM need to trigger two jobs at one iteration: > 1, one to compute summary; > {color:#172b4d}2, if > {{{color}{color:#c7a65d}{color:#172b4d}shouldDistributeGaussians}} > {color}{color}((k - {color:#4dacf0}1.0{color}) / k) * numFeatures > > {color:#4dacf0}25.0,{color} > {color:#c7a65d}{color:#172b4d}trigger another to update distributions;{color} > {color} > > {color:#c7a65d}{color:#172b4d}shouldDistributeGaussians is almost true in > practice, since numFeatures is likely to be greater than 25.{color}{color} > > {color:#c7a65d}{color:#172b4d}We can use only one job to impl above > computation,{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31032) GMM compute summary and update distributions in one pass
[ https://issues.apache.org/jira/browse/SPARK-31032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31032. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27784 [https://github.com/apache/spark/pull/27784] > GMM compute summary and update distributions in one pass > > > Key: SPARK-31032 > URL: https://issues.apache.org/jira/browse/SPARK-31032 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > > In current impl, GMM need to trigger two jobs at one iteration: > 1, one to compute summary; > {color:#172b4d}2, if > {{{color}{color:#c7a65d}{color:#172b4d}shouldDistributeGaussians}} > {color}{color}((k - {color:#4dacf0}1.0{color}) / k) * numFeatures > > {color:#4dacf0}25.0,{color} > {color:#c7a65d}{color:#172b4d}trigger another to update distributions;{color} > {color} > > {color:#c7a65d}{color:#172b4d}shouldDistributeGaussians is almost true in > practice, since numFeatures is likely to be greater than 25.{color}{color} > > {color:#c7a65d}{color:#172b4d}We can use only one job to impl above > computation,{color}{color} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31083) .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens
[ https://issues.apache.org/jira/browse/SPARK-31083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057555#comment-17057555 ] jiama commented on SPARK-31083: --- Yes, I used spark2.4 of the cdh6.2 suite, this error occurred when I used idea to directly commit job to yarn-client mode. Project is the maven project using the dependency of apache spark2.4 and spark - yarn version 2.4. The compiled scala is 2.11.12 > .ClassNotFoundException CoarseGrainedClusterMessages$RetrieveDelegationTokens > - > > Key: SPARK-31083 > URL: https://issues.apache.org/jira/browse/SPARK-31083 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.0 > Environment: spark2.4-cdh6.2 >Reporter: jiama >Priority: Major > > Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: > org.apache.spark.scheduler.cluster.CoarseGrainedClusterMessages$RetrieveDelegationTokens$ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31126) Upgrade Kafka to 2.4.1
[ https://issues.apache.org/jira/browse/SPARK-31126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31126: - Assignee: Dongjoon Hyun > Upgrade Kafka to 2.4.1 > -- > > Key: SPARK-31126 > URL: https://issues.apache.org/jira/browse/SPARK-31126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31126) Upgrade Kafka to 2.4.1
[ https://issues.apache.org/jira/browse/SPARK-31126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31126. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27881 [https://github.com/apache/spark/pull/27881] > Upgrade Kafka to 2.4.1 > -- > > Key: SPARK-31126 > URL: https://issues.apache.org/jira/browse/SPARK-31126 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30839) Add version information for Spark configuration
[ https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057524#comment-17057524 ] Hyukjin Kwon commented on SPARK-30839: -- [~beliefer] are there more jiras to add from your side? let me know when you finish this ticket then I resolve this ticket. > Add version information for Spark configuration > --- > > Key: SPARK-30839 > URL: https://issues.apache.org/jira/browse/SPARK-30839 > Project: Spark > Issue Type: Improvement > Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, > SQL, Structured Streaming, YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark ConfigEntry and ConfigBuilder missing Spark version information of each > configuration at release. This is not good for Spark user when they visiting > the page of spark configuration. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30911) Add version information to the configuration of Status
[ https://issues.apache.org/jira/browse/SPARK-30911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-30911: Assignee: jiaan.geng > Add version information to the configuration of Status > -- > > Key: SPARK-30911 > URL: https://issues.apache.org/jira/browse/SPARK-30911 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > core/src/main/scala/org/apache/spark/internal/config/Status.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30911) Add version information to the configuration of Status
[ https://issues.apache.org/jira/browse/SPARK-30911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-30911. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27848 [https://github.com/apache/spark/pull/27848] > Add version information to the configuration of Status > -- > > Key: SPARK-30911 > URL: https://issues.apache.org/jira/browse/SPARK-30911 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/Status.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31109) Add version information to the configuration of Mesos
[ https://issues.apache.org/jira/browse/SPARK-31109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31109. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27863 [https://github.com/apache/spark/pull/27863] > Add version information to the configuration of Mesos > - > > Key: SPARK-31109 > URL: https://issues.apache.org/jira/browse/SPARK-31109 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31109) Add version information to the configuration of Mesos
[ https://issues.apache.org/jira/browse/SPARK-31109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31109: Assignee: jiaan.geng > Add version information to the configuration of Mesos > - > > Key: SPARK-31109 > URL: https://issues.apache.org/jira/browse/SPARK-31109 > Project: Spark > Issue Type: Sub-task > Components: Mesos >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > esource-managers/mesos/src/main/scala/org/apache/spark/deploy/mesos/config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30839) Add version information for Spark configuration
[ https://issues.apache.org/jira/browse/SPARK-30839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jiaan.geng updated SPARK-30839: --- Component/s: YARN Structured Streaming Spark Core Mesos Kubernetes DStreams Documentation > Add version information for Spark configuration > --- > > Key: SPARK-30839 > URL: https://issues.apache.org/jira/browse/SPARK-30839 > Project: Spark > Issue Type: Improvement > Components: Documentation, DStreams, Kubernetes, Mesos, Spark Core, > SQL, Structured Streaming, YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > Spark ConfigEntry and ConfigBuilder missing Spark version information of each > configuration at release. This is not good for Spark user when they visiting > the page of spark configuration. > http://spark.apache.org/docs/latest/configuration.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31128) Fix Uncaught TypeError in streaming statistics page
Gengliang Wang created SPARK-31128: -- Summary: Fix Uncaught TypeError in streaming statistics page Key: SPARK-31128 URL: https://issues.apache.org/jira/browse/SPARK-31128 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.0.0 Reporter: Gengliang Wang Assignee: Gengliang Wang There is a minor issue in https://github.com/apache/spark/pull/26201 In the streaming statistics page, there is such error ``` streaming-page.js:211 Uncaught TypeError: Cannot read property 'top' of undefined at SVGCircleElement. (streaming-page.js:211) at SVGCircleElement.__onclick (d3.min.js:1) ``` in the console after clicking the timeline graph. We should fix it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column
[ https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505 ] Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM: --- Here you go. Avro schema: {code:java} { "type": "record", "namespace": "com.domain.em", "name": "PracticeDiff", "fields": [ { "name": "practiceId", "type": "string" }, { "name": "value", "type": "string" }, { "name": "checkedValue", "type": "string" } ] } {code} Java code: {code:java} package com.domain.em; public final class PracticeDiff { private String practiceId; private String value; private String checkedValue; public String getPracticeId() { return practiceId; } public String getValue() { return value; } public String getCheckedValue() { return checkedValue; } } {code} Thank you! was (Author: kyrill007): Here you go. Avro schema: {code:java} { "type": "record", "namespace": "com.domain.em", "name": "PracticeDiff", "fields": [ { "name": "practiceId", "type": "string" }, { "name": "value", "type": "string" }, { "name": "checkedValue", "type": "string" } ] } {code} Java code: {code:java} package com.domain.em; public final class PracticeDiff { private String practiceId; private String value; private String checkedValue; public String getPracticeId() { return practiceId; } public String getValue() { return value; } public String getCheckedValue() { return checkedValue; } } {code} Thank you! > Avro serializer should not fail when a nullable Spark field is written to a > non-null Avro column > > > Key: SPARK-31074 > URL: https://issues.apache.org/jira/browse/SPARK-31074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Kyrill Alyoshin >Priority: Major > > Spark StructType schema are strongly biased towards having _nullable_ fields. > In fact, this is what _Encoders.bean()_ does - any non-primitive field is > automatically _nullable_. When we attempt to serialize dataframes into > *user-supplied* Avro schemas where such corresponding fields are marked as > _non-null_ (i.e., they are not of _union_ type) any such attempt will fail > with the following exception > > {code:java} > Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string" > at org.apache.avro.Schema.getTypes(Schema.java:299) > at > org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) > at > org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) > {code} > This seems as rather draconian. We certainly should be able to write a field > of the same type and with the same name if it is not a null into a > non-nullable Avro column. In fact, the problem is so *severe* that it is not > clear what should be done in such situations when Avro schema is given to you > as part of API communication contract (i.e., it is non-changeable). > This is an important issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column
[ https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505 ] Kyrill Alyoshin edited comment on SPARK-31074 at 3/12/20, 1:01 AM: --- Here you go. Avro schema: {code:java} { "type": "record", "namespace": "com.domain.em", "name": "PracticeDiff", "fields": [ { "name": "practiceId", "type": "string" }, { "name": "value", "type": "string" }, { "name": "checkedValue", "type": "string" } ] } {code} Java code: {code:java} package com.domain.em; public final class PracticeDiff { private String practiceId; private String value; private String checkedValue; public String getPracticeId() { return practiceId; } public String getValue() { return value; } public String getCheckedValue() { return checkedValue; } } {code} Thank you! was (Author: kyrill007): Here you go. Avro schema: {code:java} { "type": "record", "namespace": "com.domain.em", "name": "PracticeDiff", "fields": [ { "name": "practiceId", "type": "string" }, { "name": "cisValue", "type": "string" }, { "name": "checkedValue", "type": "string" } ] } {code} Java code: {code:java} package com.domain.em; public final class PracticeDiff { private String practiceId; private String value; private String checkedValue; public String getPracticeId() { return practiceId; } public String getValue() { return value; } public String getCheckedValue() { return checkedValue; } } {code} Thank you! > Avro serializer should not fail when a nullable Spark field is written to a > non-null Avro column > > > Key: SPARK-31074 > URL: https://issues.apache.org/jira/browse/SPARK-31074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Kyrill Alyoshin >Priority: Major > > Spark StructType schema are strongly biased towards having _nullable_ fields. > In fact, this is what _Encoders.bean()_ does - any non-primitive field is > automatically _nullable_. When we attempt to serialize dataframes into > *user-supplied* Avro schemas where such corresponding fields are marked as > _non-null_ (i.e., they are not of _union_ type) any such attempt will fail > with the following exception > > {code:java} > Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string" > at org.apache.avro.Schema.getTypes(Schema.java:299) > at > org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) > at > org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) > {code} > This seems as rather draconian. We certainly should be able to write a field > of the same type and with the same name if it is not a null into a > non-nullable Avro column. In fact, the problem is so *severe* that it is not > clear what should be done in such situations when Avro schema is given to you > as part of API communication contract (i.e., it is non-changeable). > This is an important issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column
[ https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057505#comment-17057505 ] Kyrill Alyoshin commented on SPARK-31074: - Here you go. Avro schema: {code:java} { "type": "record", "namespace": "com.domain.em", "name": "PracticeDiff", "fields": [ { "name": "practiceId", "type": "string" }, { "name": "cisValue", "type": "string" }, { "name": "checkedValue", "type": "string" } ] } {code} Java code: {code:java} package com.domain.em; public final class PracticeDiff { private String practiceId; private String value; private String checkedValue; public String getPracticeId() { return practiceId; } public String getValue() { return value; } public String getCheckedValue() { return checkedValue; } } {code} Thank you! > Avro serializer should not fail when a nullable Spark field is written to a > non-null Avro column > > > Key: SPARK-31074 > URL: https://issues.apache.org/jira/browse/SPARK-31074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Kyrill Alyoshin >Priority: Major > > Spark StructType schema are strongly biased towards having _nullable_ fields. > In fact, this is what _Encoders.bean()_ does - any non-primitive field is > automatically _nullable_. When we attempt to serialize dataframes into > *user-supplied* Avro schemas where such corresponding fields are marked as > _non-null_ (i.e., they are not of _union_ type) any such attempt will fail > with the following exception > > {code:java} > Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string" > at org.apache.avro.Schema.getTypes(Schema.java:299) > at > org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) > at > org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) > {code} > This seems as rather draconian. We certainly should be able to write a field > of the same type and with the same name if it is not a null into a > non-nullable Avro column. In fact, the problem is so *severe* that it is not > clear what should be done in such situations when Avro schema is given to you > as part of API communication contract (i.e., it is non-changeable). > This is an important issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31118) Add version information to the configuration of K8S
[ https://issues.apache.org/jira/browse/SPARK-31118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31118: Assignee: jiaan.geng > Add version information to the configuration of K8S > --- > > Key: SPARK-31118 > URL: https://issues.apache.org/jira/browse/SPARK-31118 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31118) Add version information to the configuration of K8S
[ https://issues.apache.org/jira/browse/SPARK-31118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31118. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27875 [https://github.com/apache/spark/pull/27875] > Add version information to the configuration of K8S > --- > > Key: SPARK-31118 > URL: https://issues.apache.org/jira/browse/SPARK-31118 > Project: Spark > Issue Type: Sub-task > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31092) Add version information to the configuration of Yarn
[ https://issues.apache.org/jira/browse/SPARK-31092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31092. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27856 [https://github.com/apache/spark/pull/27856] > Add version information to the configuration of Yarn > > > Key: SPARK-31092 > URL: https://issues.apache.org/jira/browse/SPARK-31092 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31092) Add version information to the configuration of Yarn
[ https://issues.apache.org/jira/browse/SPARK-31092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31092: Assignee: jiaan.geng > Add version information to the configuration of Yarn > > > Key: SPARK-31092 > URL: https://issues.apache.org/jira/browse/SPARK-31092 > Project: Spark > Issue Type: Sub-task > Components: YARN >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > > resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31002) Add version information to the configuration of Core
[ https://issues.apache.org/jira/browse/SPARK-31002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31002: Assignee: jiaan.geng > Add version information to the configuration of Core > > > Key: SPARK-31002 > URL: https://issues.apache.org/jira/browse/SPARK-31002 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Assignee: jiaan.geng >Priority: Major > Fix For: 3.1.0 > > > core/src/main/scala/org/apache/spark/internal/config/package.scala -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column
[ https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057500#comment-17057500 ] Hyukjin Kwon commented on SPARK-31074: -- Thanks, [~kyrill007]. If you already have the codes, can you paste here? So people just can copy and paste to reproduce > Avro serializer should not fail when a nullable Spark field is written to a > non-null Avro column > > > Key: SPARK-31074 > URL: https://issues.apache.org/jira/browse/SPARK-31074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Kyrill Alyoshin >Priority: Major > > Spark StructType schema are strongly biased towards having _nullable_ fields. > In fact, this is what _Encoders.bean()_ does - any non-primitive field is > automatically _nullable_. When we attempt to serialize dataframes into > *user-supplied* Avro schemas where such corresponding fields are marked as > _non-null_ (i.e., they are not of _union_ type) any such attempt will fail > with the following exception > > {code:java} > Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string" > at org.apache.avro.Schema.getTypes(Schema.java:299) > at > org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) > at > org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) > {code} > This seems as rather draconian. We certainly should be able to write a field > of the same type and with the same name if it is not a null into a > non-nullable Avro column. In fact, the problem is so *severe* that it is not > clear what should be done in such situations when Avro schema is given to you > as part of API communication contract (i.e., it is non-changeable). > This is an important issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31039) Unable to use vendor specific datatypes with JDBC (MSSQL)
[ https://issues.apache.org/jira/browse/SPARK-31039?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057497#comment-17057497 ] Hyukjin Kwon commented on SPARK-31039: -- When you describe an issue, it's best to be specific and explicit. The fix might address the general issue if it is. > Unable to use vendor specific datatypes with JDBC (MSSQL) > - > > Key: SPARK-31039 > URL: https://issues.apache.org/jira/browse/SPARK-31039 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Frank Oosterhuis >Priority: Major > > I'm trying to create a table in MSSQL with a time(7) type. > For this I'm using the createTableColumnTypes option like "CallStartTime > time(7)", with driver > "{color:#212121}com.microsoft.sqlserver.jdbc.SQLServerDriver"{color} > I'm getting an error: > {color:#212121}org.apache.spark.sql.catalyst.parser.ParseException: DataType > time(7) is not supported.(line 1, pos 43){color} > {color:#212121}What is then the point of using this option?{color} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31110) refine sql doc for SELECT
[ https://issues.apache.org/jira/browse/SPARK-31110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31110. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27866 [https://github.com/apache/spark/pull/27866] > refine sql doc for SELECT > - > > Key: SPARK-31110 > URL: https://issues.apache.org/jira/browse/SPARK-31110 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31127) Add abstract Selector
Huaxin Gao created SPARK-31127: -- Summary: Add abstract Selector Key: SPARK-31127 URL: https://issues.apache.org/jira/browse/SPARK-31127 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.1.0 Reporter: Huaxin Gao Add abstract Selector. Put the common code between ChisqSelector and FValueSelector to Selector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057472#comment-17057472 ] Dongjoon Hyun commented on SPARK-25987: --- I believe Janino is required in any way for stability, but I'm not sure what the other addition patches are. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31126) Upgrade Kafka to 2.4.1
Dongjoon Hyun created SPARK-31126: - Summary: Upgrade Kafka to 2.4.1 Key: SPARK-31126 URL: https://issues.apache.org/jira/browse/SPARK-31126 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.0.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19335) Spark should support doing an efficient DataFrame Upsert via JDBC
[ https://issues.apache.org/jira/browse/SPARK-19335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057469#comment-17057469 ] John Lonergan commented on SPARK-19335: --- Streaming continous data from a fire hose source into a JDBC store. Need batched commits for throughput, but also need batches size control to keep latency under control. ie Delayed commits but not too delayed. And I want to do this without risk of data loss in the event of losing the infra so the commits and checkpointing need to be aligned. Any examples of this in practice? Have this working in Flink with almost no effort, but would prefer Spark for consistency. > Spark should support doing an efficient DataFrame Upsert via JDBC > - > > Key: SPARK-19335 > URL: https://issues.apache.org/jira/browse/SPARK-19335 > Project: Spark > Issue Type: Improvement >Reporter: Ilya Ganelin >Priority: Minor > > Doing a database update, as opposed to an insert is useful, particularly when > working with streaming applications which may require revisions to previously > stored data. > Spark DataFrames/DataSets do not currently support an Update feature via the > JDBC Writer allowing only Overwrite or Append. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31062) Improve Spark Decommissioning K8s test relability
[ https://issues.apache.org/jira/browse/SPARK-31062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31062. -- Fix Version/s: 3.1.0 Resolution: Fixed > Improve Spark Decommissioning K8s test relability > - > > Key: SPARK-31062 > URL: https://issues.apache.org/jira/browse/SPARK-31062 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Minor > Fix For: 3.1.0 > > > The test currently flakes more than the other Kubernetes tests. We can remove > some of the timing that is likely to be a source of flakiness. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31125) When processing new K8s state snapshots Spark treats Terminating nodes as terminated.
Holden Karau created SPARK-31125: Summary: When processing new K8s state snapshots Spark treats Terminating nodes as terminated. Key: SPARK-31125 URL: https://issues.apache.org/jira/browse/SPARK-31125 Project: Spark Issue Type: Bug Components: Kubernetes Affects Versions: 3.1.0 Reporter: Holden Karau Assignee: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057339#comment-17057339 ] L. C. Hsieh commented on SPARK-25987: - Hmm, thanks for checking, so currently we are not sure what patch fixes this issue? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057338#comment-17057338 ] Dongjoon Hyun commented on SPARK-25987: --- Yes. I agree that that was misleading. I removed the link. Hi, [~kiszk]. Do you know that fixed this at 3.0.0? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6
[ https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057337#comment-17057337 ] Dongjoon Hyun commented on SPARK-29183: --- Thank you so much, [~shaneknapp]! > Upgrade JDK 11 Installation to 11.0.6 > - > > Key: SPARK-29183 > URL: https://issues.apache.org/jira/browse/SPARK-29183 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > Every JDK 11.0.x releases have many fixes including performance regression > fix. We had better upgrade it to the latest 11.0.4. > - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields
[ https://issues.apache.org/jira/browse/SPARK-22523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-22523. - > Janino throws StackOverflowError on nested structs with many fields > --- > > Key: SPARK-22523 > URL: https://issues.apache.org/jira/browse/SPARK-22523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: * Linux > * Scala: 2.11.8 > * Spark: 2.2.0 >Reporter: Utku Demir >Priority: Minor > > When running the below application, Janino throws StackOverflow: > {code} > Exception in thread "main" java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > {code} > Problematic code: > {code:title=Example.scala|borderStyle=solid} > import org.apache.spark.sql._ > case class Foo( > f1: Int = 0, > f2: Int = 0, > f3: Int = 0, > f4: Int = 0, > f5: Int = 0, > f6: Int = 0, > f7: Int = 0, > f8: Int = 0, > f9: Int = 0, > f10: Int = 0, > f11: Int = 0, > f12: Int = 0, > f13: Int = 0, > f14: Int = 0, > f15: Int = 0, > f16: Int = 0, > f17: Int = 0, > f18: Int = 0, > f19: Int = 0, > f20: Int = 0, > f21: Int = 0, > f22: Int = 0, > f23: Int = 0, > f24: Int = 0 > ) > case class Nest[T]( > a: T, > b: T > ) > object Nest { > def apply[T](t: T): Nest[T] = new Nest(t, t) > } > object Main { > def main(args: Array[String]) { > val spark: SparkSession = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import spark.implicits._ > val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0) > Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => > s).collect > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-22761) 64KB JVM bytecode limit problem with GLM
[ https://issues.apache.org/jira/browse/SPARK-22761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-22761. - > 64KB JVM bytecode limit problem with GLM > > > Key: SPARK-22761 > URL: https://issues.apache.org/jira/browse/SPARK-22761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: Alan Lai >Priority: Major > > {code:java} > GLM > {code} (presumably other mllib tools) > can throw an exception due to the 64KB JVM bytecode limit when they use with > a lot of variables/arguments (~ 2k). -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22523) Janino throws StackOverflowError on nested structs with many fields
[ https://issues.apache.org/jira/browse/SPARK-22523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22523: -- Parent: SPARK-22510 Issue Type: Sub-task (was: Bug) > Janino throws StackOverflowError on nested structs with many fields > --- > > Key: SPARK-22523 > URL: https://issues.apache.org/jira/browse/SPARK-22523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 2.2.0 > Environment: * Linux > * Scala: 2.11.8 > * Spark: 2.2.0 >Reporter: Utku Demir >Priority: Minor > > When running the below application, Janino throws StackOverflow: > {code} > Exception in thread "main" java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:370) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:541) > {code} > Problematic code: > {code:title=Example.scala|borderStyle=solid} > import org.apache.spark.sql._ > case class Foo( > f1: Int = 0, > f2: Int = 0, > f3: Int = 0, > f4: Int = 0, > f5: Int = 0, > f6: Int = 0, > f7: Int = 0, > f8: Int = 0, > f9: Int = 0, > f10: Int = 0, > f11: Int = 0, > f12: Int = 0, > f13: Int = 0, > f14: Int = 0, > f15: Int = 0, > f16: Int = 0, > f17: Int = 0, > f18: Int = 0, > f19: Int = 0, > f20: Int = 0, > f21: Int = 0, > f22: Int = 0, > f23: Int = 0, > f24: Int = 0 > ) > case class Nest[T]( > a: T, > b: T > ) > object Nest { > def apply[T](t: T): Nest[T] = new Nest(t, t) > } > object Main { > def main(args: Array[String]) { > val spark: SparkSession = > SparkSession.builder().appName("test").master("local[*]").getOrCreate() > import spark.implicits._ > val foo = Foo(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > 0, 0, 0, 0) > Seq.fill(10)(Nest(Nest(foo))).toDS.groupByKey(identity).count.map(s => > s).collect > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6
[ https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057335#comment-17057335 ] Shane Knapp commented on SPARK-29183: - i'll get to this later this week/early next. > Upgrade JDK 11 Installation to 11.0.6 > - > > Key: SPARK-29183 > URL: https://issues.apache.org/jira/browse/SPARK-29183 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > Every JDK 11.0.x releases have many fixes including performance regression > fix. We had better upgrade it to the latest 11.0.4. > - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29183) Upgrade JDK 11 Installation to 11.0.6
[ https://issues.apache.org/jira/browse/SPARK-29183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shane Knapp reassigned SPARK-29183: --- Assignee: Shane Knapp > Upgrade JDK 11 Installation to 11.0.6 > - > > Key: SPARK-29183 > URL: https://issues.apache.org/jira/browse/SPARK-29183 > Project: Spark > Issue Type: Improvement > Components: Project Infra >Affects Versions: 3.0.0 >Reporter: Dongjoon Hyun >Assignee: Shane Knapp >Priority: Major > > Every JDK 11.0.x releases have many fixes including performance regression > fix. We had better upgrade it to the latest 11.0.4. > - https://bugs.java.com/bugdatabase/view_bug.do?bug_id=JDK-8221760 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057334#comment-17057334 ] Dongjoon Hyun commented on SPARK-25987: --- Unfortunately, it seems that we need more patches from `branch-3.0`. With only Janino 3.0.11 on `branch-2.4`, it fails. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057330#comment-17057330 ] L. C. Hsieh commented on SPARK-25987: - Because I'm not sure how this got fixed. I can only see this is superceded by "SPARK-26298 Upgrade Janino version to 3.0.11", so I'm wondering if upgrading Janino can just fix this. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057327#comment-17057327 ] Dongjoon Hyun commented on SPARK-25987: --- Do you mean in `branch-2.4`? Let me check that quickly. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31077) Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
[ https://issues.apache.org/jira/browse/SPARK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31077: Assignee: Huaxin Gao > Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel > --- > > Key: SPARK-31077 > URL: https://issues.apache.org/jira/browse/SPARK-31077 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Currently, ChiSqSelector depends on mllib.ChiSqSelectorModel. Remove this > dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31077) Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel
[ https://issues.apache.org/jira/browse/SPARK-31077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31077. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27841 [https://github.com/apache/spark/pull/27841] > Remove ChiSqSelector dependency on mllib.ChiSqSelectorModel > --- > > Key: SPARK-31077 > URL: https://issues.apache.org/jira/browse/SPARK-31077 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Currently, ChiSqSelector depends on mllib.ChiSqSelectorModel. Remove this > dependency. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31095) Upgrade netty-all to 4.1.47.Final
[ https://issues.apache.org/jira/browse/SPARK-31095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31095: -- Fix Version/s: 2.4.6 > Upgrade netty-all to 4.1.47.Final > - > > Key: SPARK-31095 > URL: https://issues.apache.org/jira/browse/SPARK-31095 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Vishwas Vijaya Kumar >Assignee: Dongjoon Hyun >Priority: Major > Labels: security > Fix For: 3.0.0, 2.4.6 > > > Upgrade version of io.netty_netty-all to 4.1.44.Final > [CVE-2019-20445|https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2019-20445] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057309#comment-17057309 ] L. C. Hsieh commented on SPARK-25987: - Thanks [~dongjoon]. So upgrading Janino can fix this, right? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057301#comment-17057301 ] Dongjoon Hyun edited comment on SPARK-25987 at 3/11/20, 6:29 PM: - Thank you for commenting, [~viirya]. I confirmed that the above example is fixed at 3.0.0-preview2 while 2.4.5 still has this bug. was (Author: dongjoon): Thank you for commenting, [~viirya]. I confirmed that this is fixed at 3.0.0-preview2 while 2.4.5 still has this bug. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25987. --- Resolution: Duplicate > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25987: -- Affects Version/s: (was: 3.0.0) 2.4.5 > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 2.4.5 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057301#comment-17057301 ] Dongjoon Hyun commented on SPARK-25987: --- Thank you for commenting, [~viirya]. I confirmed that this is fixed at 3.0.0-preview2 while 2.4.5 still has this bug. > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057283#comment-17057283 ] Dongjoon Hyun commented on SPARK-29295: --- I confirmed that Apache Spark 2.1.3 and older versions have no problem. > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29295: -- Affects Version/s: (was: 2.4.4) 2.4.5 > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.5 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} > {code} > create external table test(id int) partitioned by (name string) stored as > parquet location '/tmp/p'; > set spark.sql.hive.convertMetastoreParquet=false; > insert overwrite table test partition(name='n1') select 1; > ALTER TABLE test DROP PARTITION(name='n1'); > insert overwrite table test partition(name='n1') select 2; > select id from test where name = 'n1' order by id; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29295: -- Description: When we drop a partition of a external table and then overwrite it, if we set CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition. But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result. Here is a reproduce code below(you can add it into SQLQuerySuite in hive module): {code:java} test("spark gives duplicate result when dropping a partition of an external partitioned table" + " firstly and they overwrite it") { withTable("test") { withTempDir { f => sql("create external table test(id int) partitioned by (name string) stored as " + s"parquet location '${f.getAbsolutePath}'") withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(1), Row(2))) } withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(2))) } } } } {code} {code} create external table test(id int) partitioned by (name string) stored as parquet location '/tmp/p'; set spark.sql.hive.convertMetastoreParquet=false; insert overwrite table test partition(name='n1') select 1; ALTER TABLE test DROP PARTITION(name='n1'); insert overwrite table test partition(name='n1') select 2; select id from test where name = 'n1' order by id; {code} was: When we drop a partition of a external table and then overwrite it, if we set CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this partition. But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate result. Here is a reproduce code below(you can add it into SQLQuerySuite in hive module): {code:java} test("spark gives duplicate result when dropping a partition of an external partitioned table" + " firstly and they overwrite it") { withTable("test") { withTempDir { f => sql("create external table test(id int) partitioned by (name string) stored as " + s"parquet location '${f.getAbsolutePath}'") withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> false.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(1), Row(2))) } withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) { sql("insert overwrite table test partition(name='n1') select 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("insert overwrite table test partition(name='n1') select 2") checkAnswer( sql("select id from test where name = 'n1' order by id"), Array(Row(2))) } } } } {code} > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key ->
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29295: -- Affects Version/s: 2.2.3 > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.3, 2.3.4, 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25987) StackOverflowError when executing many operations on a table with many columns
[ https://issues.apache.org/jira/browse/SPARK-25987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057280#comment-17057280 ] L. C. Hsieh commented on SPARK-25987: - Looks like Janino was upgraded, is this still an issue in 3.0? > StackOverflowError when executing many operations on a table with many columns > -- > > Key: SPARK-25987 > URL: https://issues.apache.org/jira/browse/SPARK-25987 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.2.2, 2.3.0, 2.3.2, 2.4.0, 3.0.0 > Environment: Ubuntu 18.04.1 LTS, openjdk "1.8.0_181" >Reporter: Ivan Tsukanov >Priority: Major > > When I execute > {code:java} > import org.apache.spark.sql._ > import org.apache.spark.sql.types._ > val columnsCount = 100 > val columns = (1 to columnsCount).map(i => s"col$i") > val initialData = (1 to columnsCount).map(i => s"val$i") > val df = spark.createDataFrame( > rowRDD = spark.sparkContext.makeRDD(Seq(Row.fromSeq(initialData))), > schema = StructType(columns.map(StructField(_, StringType, true))) > ) > val addSuffixUDF = udf( > (str: String) => str + "_added" > ) > implicit class DFOps(df: DataFrame) { > def addSuffix() = { > df.select(columns.map(col => > addSuffixUDF(df(col)).as(col) > ): _*) > } > } > df.addSuffix().addSuffix().addSuffix().show() > {code} > I get > {code:java} > An exception or error caused a run to abort. > java.lang.StackOverflowError > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:385) > at org.codehaus.janino.CodeContext.flowAnalysis(CodeContext.java:553) > ... > {code} > If I reduce columns number (to 10 for example) or do `addSuffix` only once - > it works fine. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29295: -- Affects Version/s: 2.3.4 > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057260#comment-17057260 ] Dongjoon Hyun commented on SPARK-29295: --- I marked this as a `correctness` issue. > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057261#comment-17057261 ] Dongjoon Hyun commented on SPARK-29295: --- Hi, [~viirya]. Could you make a backport against branch-2.4? > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29295) Duplicate result when dropping partition of an external table and then overwriting
[ https://issues.apache.org/jira/browse/SPARK-29295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29295: -- Labels: correctness (was: ) > Duplicate result when dropping partition of an external table and then > overwriting > -- > > Key: SPARK-29295 > URL: https://issues.apache.org/jira/browse/SPARK-29295 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Assignee: L. C. Hsieh >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > When we drop a partition of a external table and then overwrite it, if we set > CONVERT_METASTORE_PARQUET=true(default value), it will overwrite this > partition. > But when we set CONVERT_METASTORE_PARQUET=false, it will give duplicate > result. > Here is a reproduce code below(you can add it into SQLQuerySuite in hive > module): > {code:java} > test("spark gives duplicate result when dropping a partition of an external > partitioned table" + > " firstly and they overwrite it") { > withTable("test") { > withTempDir { f => > sql("create external table test(id int) partitioned by (name string) > stored as " + > s"parquet location '${f.getAbsolutePath}'") > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> > false.toString) { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(1), Row(2))) > } > withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> true.toString) > { > sql("insert overwrite table test partition(name='n1') select 1") > sql("ALTER TABLE test DROP PARTITION(name='n1')") > sql("insert overwrite table test partition(name='n1') select 2") > checkAnswer( sql("select id from test where name = 'n1' order by > id"), > Array(Row(2))) > } > } > } > } > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30989) TABLE.COLUMN reference doesn't work with new columns created by UDF
[ https://issues.apache.org/jira/browse/SPARK-30989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057255#comment-17057255 ] hemanth meka commented on SPARK-30989: -- The alias "cat" is defined as a dataframe having 2 columns "x" and "y". The column "z" is generated from "cat" into a new dataframe "df2" but below code works and hence this exception looks like it should be the expected behaviour. is it not? df2.select("z") > TABLE.COLUMN reference doesn't work with new columns created by UDF > --- > > Key: SPARK-30989 > URL: https://issues.apache.org/jira/browse/SPARK-30989 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Chris Suchanek >Priority: Major > > When a dataframe is created with an alias (`.as("...")`) its columns can be > referred as `TABLE.COLUMN` but it doesn't work for newly created columns with > UDF. > {code:java} > // code placeholder > df1 = sc.parallelize(l).toDF("x","y").as("cat") > val squared = udf((s: Int) => s * s) > val df2 = df1.withColumn("z", squared(col("y"))) > df2.columns //Array[String] = Array(x, y, z) > df2.select("cat.x") // works > df2.select("cat.z") // Doesn't work > // org.apache.spark.sql.AnalysisException: cannot resolve '`cat.z`' given > input > // columns: [cat.x, cat.y, z];; > {code} > Might be related to: https://issues.apache.org/jira/browse/SPARK-30532 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25193: -- Parent: SPARK-30034 Issue Type: Sub-task (was: Bug) > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057252#comment-17057252 ] Dongjoon Hyun edited comment on SPARK-25193 at 3/11/20, 5:37 PM: - This is fixed at 3.0.0 via SPARK-30034 after SPARK-23710 was (Author: dongjoon): This is fixed at 3.0.0 via SPARK-23710 > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-25193. --- Resolution: Duplicate This is fixed at 3.0.0 via SPARK-23710 > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-25193: --- > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057251#comment-17057251 ] Dongjoon Hyun commented on SPARK-25193: --- I marked this as a correctness because the result after insertion will be incorrect due to the old data. > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25193) insert overwrite doesn't throw exception when drop old data fails
[ https://issues.apache.org/jira/browse/SPARK-25193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25193: -- Labels: correctness (was: bulk-closed) > insert overwrite doesn't throw exception when drop old data fails > - > > Key: SPARK-25193 > URL: https://issues.apache.org/jira/browse/SPARK-25193 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: chen xiao >Priority: Major > Labels: correctness > > dataframe.write.mode(SaveMode.Overwrite).insertInto(s"$databaseName.$tableName") > Insert overwrite mode will drop old data in hive table if there's old data. > But if data deleting fails, no exception will be thrown and the data folder > will be like: > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-0 > hdfs://uxs_nbp/nba_score/dt=2018-08-15/seq_num=2/part-01534916642513. > Two copies of data will be kept. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31099: -- Parent: SPARK-30034 Issue Type: Sub-task (was: Improvement) > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374) > at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:216) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.ensureDbInit(MetaStoreDirectSql.java:184) > at > org.apache.hadoop.hive.metastore.MetaStoreDirectSql.(MetaStoreDirectSql.java:144) > at > org.apache.hadoop.hive.metastore.ObjectStore.initializeHelper(ObjectStore.java:410) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:342) > at > org.apache.hadoop.hive.metastore.ObjectStore.setC
[jira] [Commented] (SPARK-30565) Regression in the ORC benchmark
[ https://issues.apache.org/jira/browse/SPARK-30565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057224#comment-17057224 ] Peter Toth commented on SPARK-30565: I looked into this and the performance drop is due to the 1.2.1 -> 2.3.6 Hive version change we introduced in Spark 3. I measured that {{org.apache.hadoop.hive.ql.io.orc.ReaderImpl}} in {{hive-exec-2.3.6-core.jar}} is ~3-5 times slower than in {{hive-exec-1.2.1.spark2.jar}}. > Regression in the ORC benchmark > --- > > Key: SPARK-30565 > URL: https://issues.apache.org/jira/browse/SPARK-30565 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > New benchmark results generated in the PR > [https://github.com/apache/spark/pull/27078] show regression ~3 times. > Before: > {code} > Hive built-in ORC 520531 >8 2.0 495.8 0.6X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dL138 > After: > {code} > Hive built-in ORC 1761 1792 > 43 0.61679.3 0.1X > {code} > https://github.com/apache/spark/pull/27078/files#diff-42fe5f1ef10d8f9f274fc89b2c8d140dR138 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31099) Create migration script for metastore_db
[ https://issues.apache.org/jira/browse/SPARK-31099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057220#comment-17057220 ] Dongjoon Hyun commented on SPARK-31099: --- To [~kabhwan]. Yes. I mean that corner cases. It's the same here. We can remove local derby files. To [~rednaxelafx]. For remote HMS, we have `spark.sql.hive.metastore.version`. And, SPARK-27686 was the issue for `Update migration guide for make Hive 2.3 dependency by default`. To [~cloud_fan], yes. This scope of this issue is for local hive metastore. For remote HMS, we should follow up at SPARK-27686. cc [~smilegator] and [~yumwang] since we worked together at SPARK-27686. cc [~rxin] since he is a release manager for 3.0.0. > Create migration script for metastore_db > > > Key: SPARK-31099 > URL: https://issues.apache.org/jira/browse/SPARK-31099 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > > When an existing Derby database exists (in ./metastore_db) created by Hive > 1.2.x profile, it'll fail to upgrade itself to the Hive 2.3.x profile. > Repro steps: > 1. Build OSS or DBR master with SBT with -Phive-1.2 -Phive > -Phive-thriftserver. Make sure there's no existing ./metastore_db directory > in the repo. > 2. Run bin/spark-shell, and then spark.sql("show databases"). This will > populate the ./metastore_db directory, where the Derby-based metastore > database is hosted. This database is populated from Hive 1.2.x. > 3. Re-build OSS or DBR master with SBT with -Phive -Phive-thriftserver (drops > the Hive 1.2 profile, which makes it use the default Hive 2.3 profile) > 4. Repeat Step (2) above. This will trigger Hive 2.3.x to load the Derby > database created in Step (2), which triggers an upgrade step, and that's > where the following error will be reported. > 5. Delete the ./metastore_db and re-run Step (4). The error is no longer > reported. > {code:java} > 20/03/09 13:57:04 ERROR Datastore: Error thrown executing ALTER TABLE TBLS > ADD IS_REWRITE_ENABLED CHAR(1) NOT NULL CHECK (IS_REWRITE_ENABLED IN > ('Y','N')) : In an ALTER TABLE statement, the column 'IS_REWRITE_ENABLED' has > been specified as NOT NULL and either the DEFAULT clause was not specified or > was specified as DEFAULT NULL. > java.sql.SQLSyntaxErrorException: In an ALTER TABLE statement, the column > 'IS_REWRITE_ENABLED' has been specified as NOT NULL and either the DEFAULT > clause was not specified or was specified as DEFAULT NULL. > at > org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source) > at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown > Source) > at > org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown > Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at org.apache.derby.impl.jdbc.EmbedStatement.execute(Unknown Source) > at com.jolbox.bonecp.StatementHandle.execute(StatementHandle.java:254) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatement(AbstractTable.java:879) > at > org.datanucleus.store.rdbms.table.AbstractTable.executeDdlStatementList(AbstractTable.java:830) > at > org.datanucleus.store.rdbms.table.TableImpl.validateColumns(TableImpl.java:257) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.performTablesValidation(RDBMSStoreManager.java:3398) > at > org.datanucleus.store.rdbms.RDBMSStoreManager$ClassAdder.run(RDBMSStoreManager.java:2896) > at > org.datanucleus.store.rdbms.AbstractSchemaTransaction.execute(AbstractSchemaTransaction.java:119) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.manageClasses(RDBMSStoreManager.java:1627) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.getDatastoreClass(RDBMSStoreManager.java:672) > at > org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getStatementForCandidates(RDBMSQueryUtils.java:425) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileQueryFull(JDOQLQuery.java:865) > at > org.datanucleus.store.rdbms.query.JDOQLQuery.compileInternal(JDOQLQuery.java:347) > at org.datanucleus.store.query.Query.executeQuery(Query.java:1816) > at org.datanucleus.store.query.Query.executeWithArray(Query.java:1744) > at org.datanucleus.store.query.Query.execute(Query.java:1726) > at org.datanucleus.api.jdo.JDOQuery.executeInternal(JDOQuery.java:374)
[jira] [Closed] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun closed SPARK-24640. - > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057207#comment-17057207 ] Dongjoon Hyun edited comment on SPARK-24640 at 3/11/20, 4:57 PM: - This is reverted via https://github.com/apache/spark/pull/27834 was (Author: dongjoon): This is reverted. > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24640: -- Fix Version/s: (was: 3.0.0) > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-24640: - Assignee: (was: Maxim Gekk) > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > Fix For: 3.0.0 > > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-24640. --- Resolution: Won't Do > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Priority: Major > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24640) size(null) returns null
[ https://issues.apache.org/jira/browse/SPARK-24640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-24640: --- This is reverted. > size(null) returns null > > > Key: SPARK-24640 > URL: https://issues.apache.org/jira/browse/SPARK-24640 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Xiao Li >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Size(null) should return null instead of -1 in 3.0 release. This is a > behavior change. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31091) Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default"
[ https://issues.apache.org/jira/browse/SPARK-31091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31091. --- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27834 [https://github.com/apache/spark/pull/27834] > Revert SPARK-24640 "Return `NULL` from `size(NULL)` by default" > --- > > Key: SPARK-31091 > URL: https://issues.apache.org/jira/browse/SPARK-31091 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
[ https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31076: -- Labels: correctness (was: ) > Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time > > > Key: SPARK-31076 > URL: https://issues.apache.org/jira/browse/SPARK-31076 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Labels: correctness > Fix For: 3.0.0 > > > By default, collect() returns java.sql.Timestamp/Date instances with offsets > derived from internal values of Catalyst's TIMESTAMP/DATE that store > microseconds since the epoch. The conversion from internal values to > java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting > the resulted values before 1582 year to strings produces timestamp/date > string in Julian calendar. For example: > {code} > scala> sql("select date '1100-10-10'").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) > {code} > This can be fixed if internal Catalyst's values are converted to local > date-time in Gregorian calendar, and construct local date-time from the > resulted year, month, ..., seconds in Julian calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30935) Update MLlib, GraphX websites for 3.0
[ https://issues.apache.org/jira/browse/SPARK-30935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057154#comment-17057154 ] Huaxin Gao commented on SPARK-30935: cc [~podongfeng] I think all the docs are OK now. This can be marked as complete. > Update MLlib, GraphX websites for 3.0 > - > > Key: SPARK-30935 > URL: https://issues.apache.org/jira/browse/SPARK-30935 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Critical > > Update the sub-projects' websites to include new features in this release. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30931) ML 3.0 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-30931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057146#comment-17057146 ] Huaxin Gao commented on SPARK-30931: cc [~podongfeng] I didn't see anything else need to be changed. This ticket can be marked as complete. > ML 3.0 QA: API: Python API coverage > --- > > Key: SPARK-30931 > URL: https://issues.apache.org/jira/browse/SPARK-30931 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Priority: Major > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally > either necessary (intentional) or accidental. These must be recorded and > added in the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31117) reduce the test time of DateTimeUtilsSuite
[ https://issues.apache.org/jira/browse/SPARK-31117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31117. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27873 [https://github.com/apache/spark/pull/27873] > reduce the test time of DateTimeUtilsSuite > -- > > Key: SPARK-31117 > URL: https://issues.apache.org/jira/browse/SPARK-31117 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31124) change the default value of minPartitionNum in AQE
Wenchen Fan created SPARK-31124: --- Summary: change the default value of minPartitionNum in AQE Key: SPARK-31124 URL: https://issues.apache.org/jira/browse/SPARK-31124 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31081) Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI
[ https://issues.apache.org/jira/browse/SPARK-31081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31081: - Summary: Make the display of stageId/stageAttemptId/taskId of sql metrics configurable in UI (was: Make SQLMetrics more readable from UI) > Make the display of stageId/stageAttemptId/taskId of sql metrics configurable > in UI > > > Key: SPARK-31081 > URL: https://issues.apache.org/jira/browse/SPARK-31081 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > It makes metrics harder to read after SPARK-30209 and user may not interest > in extra info({{stageId/StageAttemptId/taskId }}) when they do not need debug. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31041) Show Maven errors from within make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31041. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27800 [https://github.com/apache/spark/pull/27800] > Show Maven errors from within make-distribution.sh > -- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Trivial > Fix For: 3.1.0 > > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > That's because Maven is accepting {{--pip}} as an option and failing, but > the user doesn't get to see the error from Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31041) Show Maven errors from within make-distribution.sh
[ https://issues.apache.org/jira/browse/SPARK-31041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31041: Assignee: Nicholas Chammas > Show Maven errors from within make-distribution.sh > -- > > Key: SPARK-31041 > URL: https://issues.apache.org/jira/browse/SPARK-31041 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Trivial > > This works: > {code:java} > ./dev/make-distribution.sh \ > --pip \ > -Phadoop-2.7 -Phive -Phadoop-cloud {code} > > But this doesn't: > {code:java} > ./dev/make-distribution.sh \ > -Phadoop-2.7 -Phive -Phadoop-cloud \ > --pip{code} > > The latter invocation yields the following, confusing output: > {code:java} > + VERSION=' -X,--debug Produce execution debug output'{code} > That's because Maven is accepting {{--pip}} as an option and failing, but > the user doesn't get to see the error from Maven. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
[ https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31076: --- Assignee: Maxim Gekk > Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time > > > Key: SPARK-31076 > URL: https://issues.apache.org/jira/browse/SPARK-31076 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > By default, collect() returns java.sql.Timestamp/Date instances with offsets > derived from internal values of Catalyst's TIMESTAMP/DATE that store > microseconds since the epoch. The conversion from internal values to > java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting > the resulted values before 1582 year to strings produces timestamp/date > string in Julian calendar. For example: > {code} > scala> sql("select date '1100-10-10'").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) > {code} > This can be fixed if internal Catalyst's values are converted to local > date-time in Gregorian calendar, and construct local date-time from the > resulted year, month, ..., seconds in Julian calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31076) Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
[ https://issues.apache.org/jira/browse/SPARK-31076?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31076. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27807 [https://github.com/apache/spark/pull/27807] > Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time > > > Key: SPARK-31076 > URL: https://issues.apache.org/jira/browse/SPARK-31076 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > By default, collect() returns java.sql.Timestamp/Date instances with offsets > derived from internal values of Catalyst's TIMESTAMP/DATE that store > microseconds since the epoch. The conversion from internal values to > java.sql.Timestamp/Date based on Proleptic Gregorian calendar but converting > the resulted values before 1582 year to strings produces timestamp/date > string in Julian calendar. For example: > {code} > scala> sql("select date '1100-10-10'").collect() > res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03]) > {code} > This can be fixed if internal Catalyst's values are converted to local > date-time in Gregorian calendar, and construct local date-time from the > resulted year, month, ..., seconds in Julian calendar. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31123) Drop does not work after join with aliases
[ https://issues.apache.org/jira/browse/SPARK-31123?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikel San Vicente updated SPARK-31123: -- Description: Hi, I am seeing a really strange behaviour in drop method after a join with aliases. It doesn't seem to find the column when I reference to it using dataframe("columnName") syntax, but it does work with other combinators like select {code:java} case class Record(a: String, dup: String) case class Record2(b: String, dup: String) val df = Seq(Record("a", "dup")).toDF val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b")) val dupCol = df("dup") joined.drop(dupCol) // Does not drop anything joined.drop(func.col("a.dup")) // It drops the column joined.select(dupCol) // It selects the column {code} was: Hi, I am seeing a really strange behaviour in drop method after a join with aliases. It doesn't seem to find the column when I reference to it using dataframe("columnName") syntax, but it does work with other combinators like select {code:java} case class Record(a: String, dup: String) case class Record2(b: String, dup: String) val df = Seq(Record("a", "dup")).toDF val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b")) val dupCol = df("dup") joined.drop(dupCol) // Does not drop anything joined.drop(func.col("a.dup")) // It works! joined.select(dupCol) // It works! {code} > Drop does not work after join with aliases > -- > > Key: SPARK-31123 > URL: https://issues.apache.org/jira/browse/SPARK-31123 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.2 >Reporter: Mikel San Vicente >Priority: Minor > > > Hi, > I am seeing a really strange behaviour in drop method after a join with > aliases. It doesn't seem to find the column when I reference to it using > dataframe("columnName") syntax, but it does work with other combinators like > select > {code:java} > case class Record(a: String, dup: String) > case class Record2(b: String, dup: String) > val df = Seq(Record("a", "dup")).toDF > val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b")) > val dupCol = df("dup") > joined.drop(dupCol) // Does not drop anything > joined.drop(func.col("a.dup")) // It drops the column > joined.select(dupCol) // It selects the column > {code} > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31123) Drop does not work after join with aliases
Mikel San Vicente created SPARK-31123: - Summary: Drop does not work after join with aliases Key: SPARK-31123 URL: https://issues.apache.org/jira/browse/SPARK-31123 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.2 Reporter: Mikel San Vicente Hi, I am seeing a really strange behaviour in drop method after a join with aliases. It doesn't seem to find the column when I reference to it using dataframe("columnName") syntax, but it does work with other combinators like select {code:java} case class Record(a: String, dup: String) case class Record2(b: String, dup: String) val df = Seq(Record("a", "dup")).toDF val joined = df.alias("a").join(df2.alias("b"), df("a") === df2("b")) val dupCol = df("dup") joined.drop(dupCol) // Does not drop anything joined.drop(func.col("a.dup")) // It works! joined.select(dupCol) // It works! {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31074) Avro serializer should not fail when a nullable Spark field is written to a non-null Avro column
[ https://issues.apache.org/jira/browse/SPARK-31074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056972#comment-17056972 ] Kyrill Alyoshin commented on SPARK-31074: - Yes, # Create a simple Avro schema file with 2 properties in it '*f1*' and '*f2*' - their types can be Strings. # Create a Spark dataframe with two fields in it '*f1*' and '*f2*' of String type that are *nullable*. # Write out this dataframe to a file using the Avro schema create in 1 through '{{avroSchema}}' option. > Avro serializer should not fail when a nullable Spark field is written to a > non-null Avro column > > > Key: SPARK-31074 > URL: https://issues.apache.org/jira/browse/SPARK-31074 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Kyrill Alyoshin >Priority: Major > > Spark StructType schema are strongly biased towards having _nullable_ fields. > In fact, this is what _Encoders.bean()_ does - any non-primitive field is > automatically _nullable_. When we attempt to serialize dataframes into > *user-supplied* Avro schemas where such corresponding fields are marked as > _non-null_ (i.e., they are not of _union_ type) any such attempt will fail > with the following exception > > {code:java} > Caused by: org.apache.avro.AvroRuntimeException: Not a union: "string" > at org.apache.avro.Schema.getTypes(Schema.java:299) > at > org.apache.spark.sql.avro.AvroSerializer.org$apache$spark$sql$avro$AvroSerializer$$resolveNullableType(AvroSerializer.scala:229) > at > org.apache.spark.sql.avro.AvroSerializer$$anonfun$3.apply(AvroSerializer.scala:209) > {code} > This seems as rather draconian. We certainly should be able to write a field > of the same type and with the same name if it is not a null into a > non-nullable Avro column. In fact, the problem is so *severe* that it is not > clear what should be done in such situations when Avro schema is given to you > as part of API communication contract (i.e., it is non-changeable). > This is an important issue. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31030) Backward Compatibility for Parsing and Formatting Datetime
[ https://issues.apache.org/jira/browse/SPARK-31030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-31030: Parent: SPARK-26904 Issue Type: Sub-task (was: Improvement) > Backward Compatibility for Parsing and Formatting Datetime > -- > > Key: SPARK-31030 > URL: https://issues.apache.org/jira/browse/SPARK-31030 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2020-03-04-10-54-05-208.png, > image-2020-03-04-10-54-13-238.png > > > *Background* > In Spark version 2.4 and earlier, datetime parsing, formatting and conversion > are performed by using the hybrid calendar ([Julian + > Gregorian|https://docs.oracle.com/javase/7/docs/api/java/util/GregorianCalendar.html]). > > Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as > well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by > using Java 8 API classes (the java.time packages that are based on [ISO > chronology|https://docs.oracle.com/javase/8/docs/api/java/time/chrono/IsoChronology.html] > ). > The switching job is completed in SPARK-26651. > > *Problem* > Switching to Java 8 datetime API breaks the backward compatibility of Spark > 2.4 and earlier when parsing datetime. Spark need its own patters definition > on datetime parsing and formatting. > > *Solution* > To avoid unexpected result changes after the underlying datetime API switch, > we propose the following solution. > * Introduce the fallback mechanism: when the Java 8-based parser fails, we > need to detect these behavior differences by falling back to the legacy > parser, and fail with a user-friendly error message to tell users what gets > changed and how to fix the pattern. > * Document the Spark’s datetime patterns: The date-time formatter of Spark > is decoupled with the Java patterns. The Spark’s patterns are mainly based on > the [Java 7’s > pattern|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > (for better backward compatibility) with the customized logic (caused by the > breaking changes between [Java > 7|https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html] > and [Java > 8|https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html] > pattern string). Below are the customized rules: > ||Pattern||Java 7||Java 8|| Example||Rule|| > |u|Day number of week (1 = Monday, ..., 7 = Sunday)|Year (Different with y, u > accept a negative value to represent BC, while y should be used together with > G to do the same thing.)|!image-2020-03-04-10-54-05-208.png! |Substitute ‘u’ > to ‘e’ and use Java 8 parser to parse the string. If parsable, return the > result; otherwise, fall back to ‘u’, and then use the legacy Java 7 parser to > parse. When it is successfully parsed, throw an exception and ask users to > change the pattern strings or turn on the legacy mode; otherwise, return NULL > as what Spark 2.4 does.| > | z| General time zone which also accepts > [RFC 822 time zones|#rfc822timezone]]|Only accept time-zone name, e.g. > Pacific Standard Time; PST|!image-2020-03-04-10-54-13-238.png! |The > semantics of ‘z’ are different between Java 7 and Java 8. Here, Spark 3.0 > follows the semantics of Java 8. > Use Java 8 to parse the string. If parsable, return the result; otherwise, > use the legacy Java 7 parser to parse. When it is successfully parsed, throw > an exception and ask users to change the pattern strings or turn on the > legacy mode; otherwise, return NULL as what Spark 2.4 does.| > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-31031) Backward Compatibility for Parsing Datetime
[ https://issues.apache.org/jira/browse/SPARK-31031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan deleted SPARK-31031: > Backward Compatibility for Parsing Datetime > --- > > Key: SPARK-31031 > URL: https://issues.apache.org/jira/browse/SPARK-31031 > Project: Spark > Issue Type: Sub-task >Reporter: Yuanjian Li >Priority: Major > > Mirror issue for SPARK-31030, because of the sub-task can't add sub-task. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31111) Fix interval output issue in ExtractBenchmark
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-3. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27867 [https://github.com/apache/spark/pull/27867] > Fix interval output issue in ExtractBenchmark > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31111) Fix interval output issue in ExtractBenchmark
[ https://issues.apache.org/jira/browse/SPARK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-3: --- Assignee: Kent Yao > Fix interval output issue in ExtractBenchmark > -- > > Key: SPARK-3 > URL: https://issues.apache.org/jira/browse/SPARK-3 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org