[jira] [Updated] (SPARK-31756) Add real headless browser support for UI test
[ https://issues.apache.org/jira/browse/SPARK-31756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-31756: --- Description: In the current master, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScriopt enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. {code:java} === EXCEPTION START Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException] com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807) at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426) at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157) at java.lang.Thread.run(Thread.java:748) Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889) ... 10 more JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". == CALLING JAVASCRIPT == function () { throw e; } === EXCEPTION END {code} I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error. {code:java} [info] UISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds) [info] The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1){code} To resolve those problems, it's better to support headless browser for UI test. was: In the current master, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScriopt enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. {code:java} === EXCEPTION
[jira] [Created] (SPARK-31756) Add real headless browser support for UI test
Kousuke Saruta created SPARK-31756: -- Summary: Add real headless browser support for UI test Key: SPARK-31756 URL: https://issues.apache.org/jira/browse/SPARK-31756 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 3.1.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta In the current master, there are two problems for UI test. 1. Lots of tests especially JavaScript related ones are done manually. Appearance is better to be confirmed by our eyes but logic should be tested by test cases ideally. 2. Compared to the real web browsers, HtmlUnit doesn't seem to support JavaScriopt enough. I added a JavaScript related test before for SPARK-31534 using HtmlUnit which is simple library based headless browser for test. The test I added works somehow but some JavaScript related error is shown in unit-tests.log. {code:java} === EXCEPTION START Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException] com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904) at net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807) at com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102) at com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426) at com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157) at java.lang.Thread.run(Thread.java:748) Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009) at net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800) at net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105) at net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413) at com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252) at net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828) at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889) ... 10 more JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type "(null|function)". == CALLING JAVASCRIPT == function () { throw e; } === EXCEPTION END {code} I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not working even though it works on real browsers like Chrome, Safari and Firefox without error. {code:java} [info] UISeleniumSuite: [info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 seconds, 745 milliseconds) [info] The code passed to eventually never returned normally. Attempted 2 times over 12.910785232 seconds. Last failure message: com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to undefined "regeneratorRuntime" in strict mode (http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1){code} To resolve those problems, it's better to support headless browser for UI test. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-31440. Resolution: Fixed This issue is resolved in https://github.com/apache/spark/pull/28208 > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Major > Attachments: current_version.json, improved_version.json, > improved_version_May1th.json > > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per physical operation by > aligning Spark UI. > 2- Support *wholeStageCodegenId* for Physical Operations > 3- *nodeId* can be useful for grouping metrics and sorting physical > operations (according to execution order) to differentiate same operators (if > used multiple times during the same query execution) and their metrics. > 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, > Spark UI does not show empty metrics. > 5- Remove line breakers(*\n*) from *metricValue*. > 6- *planDescription* can be *optional* Http parameter to avoid network cost > where there is specially complex jobs creating big-plans. > 7- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. Specially, this can be useful for the user where > *metricDetails* array size is high. > 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting > Physical Operators' execution order. > *Attachments:* > Please find both *current* and *improved* versions of the results as > attached for following SQL Rest Endpoint: > {code:java} > curl -X GET > http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31440) Improve SQL Rest API
[ https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-31440: -- Assignee: Eren Avsarogullari > Improve SQL Rest API > > > Key: SPARK-31440 > URL: https://issues.apache.org/jira/browse/SPARK-31440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Eren Avsarogullari >Assignee: Eren Avsarogullari >Priority: Major > Attachments: current_version.json, improved_version.json, > improved_version_May1th.json > > > SQL Rest API exposes query execution metrics as Public API. This Jira aims to > apply following improvements on SQL Rest API by aligning Spark-UI. > *Proposed Improvements:* > 1- Support Physical Operations and group metrics per physical operation by > aligning Spark UI. > 2- Support *wholeStageCodegenId* for Physical Operations > 3- *nodeId* can be useful for grouping metrics and sorting physical > operations (according to execution order) to differentiate same operators (if > used multiple times during the same query execution) and their metrics. > 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, > Spark UI does not show empty metrics. > 5- Remove line breakers(*\n*) from *metricValue*. > 6- *planDescription* can be *optional* Http parameter to avoid network cost > where there is specially complex jobs creating big-plans. > 7- *metrics* attribute needs to be exposed at the bottom order as > *metricDetails*. Specially, this can be useful for the user where > *metricDetails* array size is high. > 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting > Physical Operators' execution order. > *Attachments:* > Please find both *current* and *improved* versions of the results as > attached for following SQL Rest Endpoint: > {code:java} > curl -X GET > http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110888#comment-17110888 ] Apache Spark commented on SPARK-31399: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/28577 > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > Fix For: 3.0.0 > > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.sc
[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110889#comment-17110889 ] Apache Spark commented on SPARK-31399: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/28577 > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > Fix For: 3.0.0 > > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.sc
[jira] [Assigned] (SPARK-31755) allow missing year/hour when parsing date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31755: Assignee: Apache Spark (was: Wenchen Fan) > allow missing year/hour when parsing date/timestamp > --- > > Key: SPARK-31755 > URL: https://issues.apache.org/jira/browse/SPARK-31755 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31755) allow missing year/hour when parsing date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31755: Assignee: Wenchen Fan (was: Apache Spark) > allow missing year/hour when parsing date/timestamp > --- > > Key: SPARK-31755 > URL: https://issues.apache.org/jira/browse/SPARK-31755 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31755) allow missing year/hour when parsing date/timestamp
[ https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110886#comment-17110886 ] Apache Spark commented on SPARK-31755: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/28576 > allow missing year/hour when parsing date/timestamp > --- > > Key: SPARK-31755 > URL: https://issues.apache.org/jira/browse/SPARK-31755 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31755) allow missing year/hour when parsing date/timestamp
Wenchen Fan created SPARK-31755: --- Summary: allow missing year/hour when parsing date/timestamp Key: SPARK-31755 URL: https://issues.apache.org/jira/browse/SPARK-31755 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30768) Constraints inferred from inequality attributes
[ https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110877#comment-17110877 ] Yuming Wang commented on SPARK-30768: - Teradata support this: https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw?section=xcg1472241575102__application_of_transitive_closure_section > Constraints inferred from inequality attributes > --- > > Key: SPARK-30768 > URL: https://issues.apache.org/jira/browse/SPARK-30768 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > How to reproduce: > {code:sql} > create table SPARK_30768_1(c1 int, c2 int); > create table SPARK_30768_2(c1 int, c2 int); > {code} > *Spark SQL*: > {noformat} > spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on > (t1.c1 > t2.c1) where t1.c1 = 3; > == Physical Plan == > *(3) Project [c1#5, c2#6] > +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7) >:- *(1) Project [c1#5, c2#6] >: +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3)) >: +- *(1) ColumnarToRow >:+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: > true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], > ReadSchema: struct >+- BroadcastExchange IdentityBroadcastMode, [id=#60] > +- *(2) Project [c1#7] > +- *(2) Filter isnotnull(c1#7) > +- *(2) ColumnarToRow >+- FileScan parquet default.spark_30768_2[c1#7] Batched: true, > DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: > InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous..., > PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: > struct > {noformat} > *Hive* support this feature: > {noformat} > hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on > (t1.c1 > t2.c1) where t1.c1 = 3; > Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross > product > OK > STAGE DEPENDENCIES: > Stage-4 is a root stage > Stage-3 depends on stages: Stage-4 > Stage-0 depends on stages: Stage-3 > STAGE PLANS: > Stage: Stage-4 > Map Reduce Local Work > Alias -> Map Local Tables: > $hdt$_0:t1 > Fetch Operator > limit: -1 > Alias -> Map Local Operator Tree: > $hdt$_0:t1 > TableScan > alias: t1 > filterExpr: (c1 = 3) (type: boolean) > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column > stats: NONE > Filter Operator > predicate: (c1 = 3) (type: boolean) > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL > Column stats: NONE > Select Operator > expressions: c2 (type: int) > outputColumnNames: _col1 > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL > Column stats: NONE > HashTable Sink Operator > keys: > 0 > 1 > Stage: Stage-3 > Map Reduce > Map Operator Tree: > TableScan > alias: t2 > filterExpr: (c1 < 3) (type: boolean) > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column > stats: NONE > Filter Operator > predicate: (c1 < 3) (type: boolean) > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL > Column stats: NONE > Select Operator > Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL > Column stats: NONE > Map Join Operator > condition map: >Inner Join 0 to 1 > keys: > 0 > 1 > outputColumnNames: _col1 > Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL > Column stats: NONE > Select Operator > expressions: 3 (type: int), _col1 (type: int) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 1 Basic stats: > PARTIAL Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.Sequenc
[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110873#comment-17110873 ] Jungtaek Lim commented on SPARK-31754: -- [~puviarasu] Given the error comes from "generated code", you may want to turn the DEBUG log for below class and retrieve generated code, and paste these codes as well. org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator Btw higher priorities than major should go through committer's decision. I'll lower the priority and see the decision from committer. (Personally it looks like an edge-case, not meant to be a blocker.) > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$
[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-31754: - Priority: Major (was: Blocker) > Spark Structured Streaming: NullPointerException in Stream Stream join > -- > > Key: SPARK-31754 > URL: https://issues.apache.org/jira/browse/SPARK-31754 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 > Environment: Spark Version : 2.4.0 > Hadoop Version : 3.0.0 >Reporter: Puviarasu >Priority: Major > Labels: structured-streaming > > When joining 2 streams with watermarking and windowing we are getting > NullPointer Exception after running for few minutes. > After failure we analyzed the checkpoint offsets/sources and found the files > for which the application failed. These files are not having any null values > in the join columns. > We even started the job with the files and the application ran. From this we > concluded that the exception is not because of the data from the streams. > *Code:* > > {code:java} > val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> > "1" ) > val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", > "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" > ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> > "1" ) > > spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") > > spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") > spark.sql("select * from source1 where eventTime1 is not null and col1 is > not null").withWatermark("eventTime1", "30 > minutes").createTempView("viewNotNull1") > spark.sql("select * from source2 where eventTime2 is not null and col2 is > not null").withWatermark("eventTime2", "30 > minutes").createTempView("viewNotNull2") > spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = > b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + > interval 2 hours").createTempView("join") > val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> > "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") > spark.sql("select * from > join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 > seconds")).format("parquet").options(optionsMap3).start() > {code} > > *Exception:* > > {code:java} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Aborting TaskSet 4.0 because task 0 (partition 0) > cannot run anywhere due to node and executor blacklist. > Most recent failure: > Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown > Source) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) > at > org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) > at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) > at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at > org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) > at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583) > at > org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(sta
[jira] [Updated] (SPARK-31706) add back the support of streaming update mode
[ https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-31706: Priority: Blocker (was: Major) > add back the support of streaming update mode > - > > Key: SPARK-31706 > URL: https://issues.apache.org/jira/browse/SPARK-31706 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
[ https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Puviarasu updated SPARK-31754: -- Description: When joining 2 streams with watermarking and windowing we are getting NullPointer Exception after running for few minutes. After failure we analyzed the checkpoint offsets/sources and found the files for which the application failed. These files are not having any null values in the join columns. We even started the job with the files and the application ran. From this we concluded that the exception is not because of the data from the streams. *Code:* {code:java} val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> "1" ) val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> "1" ) spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") spark.sql("select * from source1 where eventTime1 is not null and col1 is not null").withWatermark("eventTime1", "30 minutes").createTempView("viewNotNull1") spark.sql("select * from source2 where eventTime2 is not null and col2 is not null").withWatermark("eventTime2", "30 minutes").createTempView("viewNotNull2") spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + interval 2 hours").createTempView("join") val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") spark.sql("select * from join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds")).format("parquet").options(optionsMap3).start() {code} *Exception:* {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 4.0 because task 0 (partition 0) cannot run anywhere due to node and executor blacklist. Most recent failure: Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583) at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:108) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.timeTakenMs(StreamingSymmetricHashJoinExec.scala:126) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1(StreamingSymmetricHashJ at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$processPartitions$1.apply$mcV$sp(St:361) at org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44) at org.apache.spark.util.CompletionIterat
[jira] [Created] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join
Puviarasu created SPARK-31754: - Summary: Spark Structured Streaming: NullPointerException in Stream Stream join Key: SPARK-31754 URL: https://issues.apache.org/jira/browse/SPARK-31754 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.4.0 Environment: Spark Version : 2.4.0 Hadoop Version : 3.0.0 Reporter: Puviarasu When joining 2 streams with watermarking and windowing we are getting NullPointer Exception after running for few minutes. After failure we analyzed the checkpoint offsets/sources and found the files for which the application failed. These files are not having any null values in the join columns. We even started the job with the files and the application ran. From this we concluded that the exception is not because of the data from the streams. *Code:* {code:java} {code} *val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> "1" ) val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> "1" ) spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1") spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2") spark.sql("select * from source1 where eventTime1 is not null and col1 is not null").withWatermark("eventTime1", "30 minutes").createTempView("viewNotNull1") spark.sql("select * from source2 where eventTime2 is not null and col2 is not null").withWatermark("eventTime2", "30 minutes").createTempView("viewNotNull2") spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + interval 2 hours").createTempView("join") val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3") spark.sql("select * from join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 seconds")).format("parquet").options(optionsMap3).start()* *Exception:* {code:java} Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Aborting TaskSet 4.0 because task 0 (partition 0) cannot run anywhere due to node and executor blacklist. Most recent failure: Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221) at org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73) at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream) at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583) at org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:108) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.timeTakenMs(StreamingSymmetricHashJoinExec.scala:126) at org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1(StreamingSymmetricHashJ at org.apache.spark.sql.execution.streaming.StreamingSymmetricHa
[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110836#comment-17110836 ] Apache Spark commented on SPARK-31705: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/28575 > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND NOT ( ( l_suppkey > 3 >AND ( l_suppkey > 2 > OR o_custkey > 13 ) ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custk
[jira] [Assigned] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31705: Assignee: Apache Spark (was: Yuming Wang) > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Apache Spark >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND NOT ( ( l_suppkey > 3 >AND ( l_suppkey > 2 > OR o_custkey > 13 ) ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >
[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110835#comment-17110835 ] Apache Spark commented on SPARK-31705: -- User 'wangyum' has created a pull request for this issue: https://github.com/apache/spark/pull/28575 > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND NOT ( ( l_suppkey > 3 >AND ( l_suppkey > 2 > OR o_custkey > 13 ) ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custk
[jira] [Assigned] (SPARK-31705) Rewrite join condition to conjunctive normal form
[ https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31705: Assignee: Yuming Wang (was: Apache Spark) > Rewrite join condition to conjunctive normal form > - > > Key: SPARK-31705 > URL: https://issues.apache.org/jira/browse/SPARK-31705 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Yuming Wang >Assignee: Yuming Wang >Priority: Major > > Rewrite join condition to [conjunctive normal > form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more > conditions to filter. > PostgreSQL: > {code:sql} > CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, > > l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0), > > l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), > > l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate > DATE, > l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255)); > > CREATE TABLE orders ( > o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255), > o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255), > o_clerk varchar(255), o_shippriority INT, o_comment varchar(255)); > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND ( ( l_suppkey > 3 >AND o_custkey > 13 ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem >JOIN orders > ON l_orderkey = o_orderkey > AND ( ( l_suppkey > 3 > AND o_custkey > 13 ) >OR ( l_suppkey > 1 > AND o_custkey > 11 ) ) > AND l_partkey > 19; > EXPLAIN > SELECT Count(*) > FROM lineitem, >orders > WHERE l_orderkey = o_orderkey >AND NOT ( ( l_suppkey > 3 >AND ( l_suppkey > 2 > OR o_custkey > 13 ) ) > OR ( l_suppkey > 1 >AND o_custkey > 11 ) ) >AND l_partkey > 19; > {code} > {noformat} > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem, > postgres-#orders > postgres-# WHERE l_orderkey = o_orderkey > postgres-#AND ( ( l_suppkey > 3 > postgres(#AND o_custkey > 13 ) > postgres(# OR ( l_suppkey > 1 > postgres(#AND o_custkey > 11 ) ) > postgres-#AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-> Seq Scan on lineitem (cost=0.00..10.53 rows=6 width=16) > Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR > (l_suppkey > 1))) > (9 rows) > postgres=# EXPLAIN > postgres-# SELECT Count(*) > postgres-# FROM lineitem > postgres-#JOIN orders > postgres-# ON l_orderkey = o_orderkey > postgres-# AND ( ( l_suppkey > 3 > postgres(# AND o_custkey > 13 ) > postgres(#OR ( l_suppkey > 1 > postgres(# AND o_custkey > 11 ) ) > postgres-# AND l_partkey > 19; >QUERY PLAN > - > Aggregate (cost=21.18..21.19 rows=1 width=8) >-> Hash Join (cost=10.60..21.17 rows=2 width=0) > Hash Cond: (orders.o_orderkey = lineitem.l_orderkey) > Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) > OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11))) > -> Seq Scan on orders (cost=0.00..10.45 rows=17 width=16) >Filter: ((o_custkey > 13) OR (o_custkey > 11)) > -> Hash (cost=10.53..10.53 rows=6 width=16) >-
[jira] [Commented] (SPARK-31752) Add sql doc for interval type
[ https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110813#comment-17110813 ] Apache Spark commented on SPARK-31752: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28574 > Add sql doc for interval type > - > > Key: SPARK-31752 > URL: https://issues.apache.org/jira/browse/SPARK-31752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31752) Add sql doc for interval type
[ https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110812#comment-17110812 ] Apache Spark commented on SPARK-31752: -- User 'yaooqinn' has created a pull request for this issue: https://github.com/apache/spark/pull/28574 > Add sql doc for interval type > - > > Key: SPARK-31752 > URL: https://issues.apache.org/jira/browse/SPARK-31752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31752) Add sql doc for interval type
[ https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31752: Assignee: Apache Spark > Add sql doc for interval type > - > > Key: SPARK-31752 > URL: https://issues.apache.org/jira/browse/SPARK-31752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31752) Add sql doc for interval type
[ https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31752: Assignee: (was: Apache Spark) > Add sql doc for interval type > - > > Key: SPARK-31752 > URL: https://issues.apache.org/jira/browse/SPARK-31752 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.
[ https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110808#comment-17110808 ] Apache Spark commented on SPARK-29458: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28290 > Document scalar functions usage in APIs in SQL getting started. > --- > > Key: SPARK-29458 > URL: https://issues.apache.org/jira/browse/SPARK-29458 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Dilip Biswal >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31753) Add missing keywords in the SQL documents
Takeshi Yamamuro created SPARK-31753: Summary: Add missing keywords in the SQL documents Key: SPARK-31753 URL: https://issues.apache.org/jira/browse/SPARK-31753 Project: Spark Issue Type: Improvement Components: Documentation, SQL Affects Versions: 3.1.0 Reporter: Takeshi Yamamuro Some keywords are missing in the SQL documents and a list of them is as follows. [https://github.com/apache/spark/pull/28290#issuecomment-619321301] {code:java} AFTER CASE/ELSE WHEN/THEN IGNORE NULLS LATERAL VIEW (OUTER)? MAP KEYS TERMINATED BY NULL DEFINED AS LINES TERMINATED BY ESCAPED BY COLLECTION ITEMS TERMINATED BY EXPLAIN LOGICAL PIVOT {code} They should be documented there, too. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31752) Add sql doc for interval type
Kent Yao created SPARK-31752: Summary: Add sql doc for interval type Key: SPARK-31752 URL: https://issues.apache.org/jira/browse/SPARK-31752 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0, 3.1.0 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31102) spark-sql fails to parse when contains comment
[ https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31102. -- Fix Version/s: 3.0.0 Assignee: Javier Fuentes Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/27920] > spark-sql fails to parse when contains comment > -- > > Key: SPARK-31102 > URL: https://issues.apache.org/jira/browse/SPARK-31102 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Assignee: Javier Fuentes >Priority: Major > Fix For: 3.0.0 > > > {code:sql} > select > 1, > -- two > 2; > {code} > {noformat} > spark-sql> select > > 1, > > -- two > > 2; > Error in query: > mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', > 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', > 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', > 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', > 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', > 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', > 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', > 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', > DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', > 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', > 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', > 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', > 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', > 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', > 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', > 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', > 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', > 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', > 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', > 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', > 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', > 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', > 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', > 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', > 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', > 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', > 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', > 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', > 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', > 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', > 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', > 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', > 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', > 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', > 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', > 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', > 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, > INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, > BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2) > == SQL == > select > 1, > --^^^ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation
[ https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-31739: - Fix Version/s: 3.0.0 > Docstring syntax issues prevent proper compilation of documentation > --- > > Key: SPARK-31739 > URL: https://issues.apache.org/jira/browse/SPARK-31739 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.5 >Reporter: David Toneian >Assignee: David Toneian >Priority: Trivial > Fix For: 3.0.0, 3.1.0 > > > Some docstrings contain mistakes, like missing or spurious spaces, which > prevent the documentation from being rendered as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31555) Improve cache block migration
[ https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110732#comment-17110732 ] Dale Richardson commented on SPARK-31555: - Hi [~holden], happy to have a go at this. > Improve cache block migration > - > > Key: SPARK-31555 > URL: https://issues.apache.org/jira/browse/SPARK-31555 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We should explore the following improvements to cache block migration: > 1) Peer selection (right now may overbalance on certain peers) > 2) Do we need to configure the number of blocks to be migrated at the same > time > 3) Are there any blocks we don't need to replicate (e.g. they are already > stored on the desired number of executors even once we remove the executors > slated for decommissioning). > 4) Do we want to prioritize migrating blocks with no replicas > 5) Log the attempt number for debugging > 6) Clarify the logic for determining the number of replicas > 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count > to wait for the executors to come up. imho this is the least important. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110729#comment-17110729 ] Sudharshann D. commented on SPARK-31579: Please see my proof of concept [https://github.com/Sudhar287/spark/pull/1/files|http://example.com] > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory
[ https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31692: -- Affects Version/s: 2.3.4 2.4.5 > Hadoop confs passed via spark config are not set in URLStream Handler Factory > - > > Key: SPARK-31692 > URL: https://issues.apache.org/jira/browse/SPARK-31692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Karuppayya >Assignee: Karuppayya >Priority: Major > Fix For: 3.0.0 > > > Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in > URLStreamHandlerFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
[ https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25694: -- Fix Version/s: 2.4.7 > URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue > --- > > Key: SPARK-25694 > URL: https://issues.apache.org/jira/browse/SPARK-25694 > Project: Spark > Issue Type: Bug > Components: Spark Core, SQL >Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0 >Reporter: Bo Yang >Assignee: Zhou Jiang >Priority: Minor > Fix For: 3.0.0, 2.4.7 > > > URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() > returns FsUrlConnection object, which is not compatible with > HttpURLConnection. This will cause exception when using some third party http > library (e.g. scalaj.http). > The following code in Spark 2.3.0 introduced the issue: > sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala: > {code} > object SharedState extends Logging { ... > URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()) ... > } > {code} > Here is the example exception when using scalaj.http in Spark: > {code} > StackTrace: scala.MatchError: > org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/] > (of class org.apache.hadoop.fs.FsUrlConnection) > at > scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343) > at scalaj.http.HttpRequest.exec(Http.scala:335) > at scalaj.http.HttpRequest.asString(Http.scala:455) > {code} > > One option to fix the issue is to return null in > URLStreamHandlerFactory.createURLStreamHandler when the protocol is > http/https, so it will use the default behavior and be compatible with > scalaj.http. Following is the code example: > {code} > class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with > Logging { > private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory() > override def createURLStreamHandler(protocol: String): URLStreamHandler = { > val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol) > if (handler == null) { > return null > } > if (protocol != null && > (protocol.equalsIgnoreCase("http") > || protocol.equalsIgnoreCase("https"))) { > // return null to use system default URLStreamHandler > null > } else { > handler > } > } > } > {code} > I would like to get some discussion here before submitting a pull request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory
[ https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-31692: -- Fix Version/s: 2.4.7 > Hadoop confs passed via spark config are not set in URLStream Handler Factory > - > > Key: SPARK-31692 > URL: https://issues.apache.org/jira/browse/SPARK-31692 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.4, 2.4.5, 3.0.0 >Reporter: Karuppayya >Assignee: Karuppayya >Priority: Major > Fix For: 3.0.0, 2.4.7 > > > Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in > URLStreamHandlerFactory -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110726#comment-17110726 ] Apache Spark commented on SPARK-31579: -- User 'Sudhar287' has created a pull request for this issue: https://github.com/apache/spark/pull/28573 > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31579: Assignee: (was: Apache Spark) > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110725#comment-17110725 ] Apache Spark commented on SPARK-31579: -- User 'Sudhar287' has created a pull request for this issue: https://github.com/apache/spark/pull/28573 > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()
[ https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31579: Assignee: Apache Spark > Replace floorDiv by / in localRebaseGregorianToJulianDays() > --- > > Key: SPARK-31579 > URL: https://issues.apache.org/jira/browse/SPARK-31579 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Minor > Labels: starter > > Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check > that for all available time zones in the range of [0001, 2100] years with the > step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv > can be replaced by /, and this should improve performance of > RebaseDateTime.localRebaseGregorianToJulianDays. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31257) Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes
[ https://issues.apache.org/jira/browse/SPARK-31257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-31257: - Summary: Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes (was: Fix ambiguous two different CREATE TABLE syntaxes) > Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes > -- > > Key: SPARK-31257 > URL: https://issues.apache.org/jira/browse/SPARK-31257 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > There's a discussion in dev@ mailing list to point out ambiguous syntaxes for > CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via > unifying the create table syntax. > https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E > We should ensure the new "single" create table syntax is very deterministic > to both devs and end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31257) Fix ambiguous two different CREATE TABLE syntaxes
[ https://issues.apache.org/jira/browse/SPARK-31257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-31257: - Affects Version/s: (was: 3.0.0) 3.1.0 Description: There's a discussion in dev@ mailing list to point out ambiguous syntaxes for CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via unifying the create table syntax. https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E We should ensure the new "single" create table syntax is very deterministic to both devs and end users. was: There's a discussion in dev@ mailing list to point out ambiguous syntaxes for CREATE TABLE DDL. This issue tracks the efforts to resolve the problem. https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E Note that the priority of this issue is set to blocker as the ambiguity is brought by SPARK-30098 which will be shipped in Spark 3.0.0; before we ship SPARK-30098 we should fix the syntax and ensure the syntax is very deterministic to both devs and end users. Issue Type: Improvement (was: Bug) Priority: Major (was: Blocker) > Fix ambiguous two different CREATE TABLE syntaxes > - > > Key: SPARK-31257 > URL: https://issues.apache.org/jira/browse/SPARK-31257 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Jungtaek Lim >Priority: Major > > There's a discussion in dev@ mailing list to point out ambiguous syntaxes for > CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via > unifying the create table syntax. > https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E > We should ensure the new "single" create table syntax is very deterministic > to both devs and end users. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution
[ https://issues.apache.org/jira/browse/SPARK-31718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-31718: - Fix Version/s: (was: 2.4.0) Target Version/s: (was: 2.4.0) Don't set Fix/Target Version > DataSourceV2 unexpected behavior with partition data distribution > -- > > Key: SPARK-31718 > URL: https://issues.apache.org/jira/browse/SPARK-31718 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.0 >Reporter: Serhii >Priority: Major > > Hi team, > > We are using DataSourceV2. > > We have a queston regarding using interface > org.apache.spark.sql.sources.v2.writer.DataWriter > > We have faced with following unexpected behavior. > When we use a repartion on dataframe we expect that for each partion Spark > will create new instance of DataWriter interface and sends the repartition > data to appropriate instances but sometimes we observe that Spark sends the > data from different partitions to the same instance of DataWriter interface. > It behavior sometimes occures on Yarn cluster. > > If we run Spark job as Local run Spark really creates a new instance of > DataWriter interface for each partiion after repartion and publishes the > repartion data to appropriate instances. > > Possible there is a Spark limit a number of DataWriter instances? > Can you explain it is a bug or expected behavior? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16423) Inconsistent settings on the first day of a week
[ https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577 ] Eric Sun edited comment on SPARK-16423 at 5/18/20, 8:12 PM: Just to clarify the issue again: {code:scala} scala> spark.sql("SELECT dayofweek('2020-05-18')").show() +---+ |dayofweek(CAST(2020-05-18 AS DATE))| +---+ | 2| +---+ scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, '-MM-dd'))").show() +--+ |date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS TIMESTAMP))| +--+ | 2020-05-18 00:00:00| +--+ {code} * 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this week is *Sunday*, right? * date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is *Monday*, see a tiny discrepancy? https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_date-format {code:sql} -- MySql behavior SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u') FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x; ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, '%YW%u')|DATE_FORMAT(ds, '%YW%V')| ---|-|||| 2020-05-18 19:20:39|2|2020W20 |2020W21 |2020W20 | {code} The request for this JIRA is: should we allow date_trunc() and dayofweek() support different 1st day of week option (SUNDAY or MONDAY)? was (Author: ericsun2): Just to clarify the issue again: {code:scala} scala> spark.sql("SELECT dayofweek('2020-05-18')").show() +---+ |dayofweek(CAST(2020-05-18 AS DATE))| +---+ | 2| +---+ scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, '-MM-dd'))").show() +--+ |date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS TIMESTAMP))| +--+ | 2020-05-18 00:00:00| +--+ {code} * 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this week is *Sunday*, right? * date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is *Monday*, see a tiny discrepancy? {code:sql} -- MySql behavior SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u') FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x; ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, '%YW%u')|DATE_FORMAT(ds, '%YW%V')| ---|-|||| 2020-05-18 19:20:39|2|2020W20 |2020W21 |2020W20 | {code} The request for this JIRA is: should we allow date_trunc() and dayofweek() support different 1st day of week option (SUNDAY or MONDAY)? > Inconsistent settings on the first day of a week > > > Key: SPARK-16423 > URL: https://issues.apache.org/jira/browse/SPARK-16423 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Major > Labels: bulk-closed > > For the function {{WeekOfYear}}, we explicitly set the first day of the week > to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. > So, we are using the default first day of the week based on the locale > setting (see > https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-). > > Let's do a survey on what other databases do and make the setting consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16423) Inconsistent settings on the first day of a week
[ https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577 ] Eric Sun edited comment on SPARK-16423 at 5/18/20, 8:11 PM: Just to clarify the issue again: {code:scala} scala> spark.sql("SELECT dayofweek('2020-05-18')").show() +---+ |dayofweek(CAST(2020-05-18 AS DATE))| +---+ | 2| +---+ scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, '-MM-dd'))").show() +--+ |date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS TIMESTAMP))| +--+ | 2020-05-18 00:00:00| +--+ {code} * 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this week is *Sunday*, right? * date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is *Monday*, see a tiny discrepancy? {code:sql} -- MySql behavior SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u') FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x; ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, '%YW%u')|DATE_FORMAT(ds, '%YW%V')| ---|-|||| 2020-05-18 19:20:39|2|2020W20 |2020W21 |2020W20 | {code} The request for this JIRA is: should we allow date_trunc() and dayofweek() support different 1st day of week option (SUNDAY or MONDAY)? was (Author: ericsun2): Just to clarify the issue again: {code:scala} scala> spark.sql("SELECT dayofweek('2020-05-18')").show() +---+ |dayofweek(CAST(2020-05-18 AS DATE))| +---+ | 2| +---+ scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, '-MM-dd'))").show() +--+ |date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS TIMESTAMP))| +--+ | 2020-05-18 00:00:00| +--+ {code} * 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this week is Sunday, right? * date_trunc('week', '2020-05-18') => Monday, the 1st of of this week if Monday, see a tiny discrepancy? {code:sql} -- MySql behavior SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u') FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x; ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, '%YW%u')|DATE_FORMAT(ds, '%YW%V')| ---|-|||| 2020-05-18 19:20:39|2|2020W20 |2020W21 |2020W20 | {code} The request for this JIRA is: should we allow date_trunc() and dayofweek() support different 1st day of week option (SUNDAY or MONDAY)? > Inconsistent settings on the first day of a week > > > Key: SPARK-16423 > URL: https://issues.apache.org/jira/browse/SPARK-16423 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Major > Labels: bulk-closed > > For the function {{WeekOfYear}}, we explicitly set the first day of the week > to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. > So, we are using the default first day of the week based on the locale > setting (see > https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-). > > Let's do a survey on what other databases do and make the setting consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16423) Inconsistent settings on the first day of a week
[ https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577 ] Eric Sun commented on SPARK-16423: -- Just to clarify the issue again: {code:scala} scala> spark.sql("SELECT dayofweek('2020-05-18')").show() +---+ |dayofweek(CAST(2020-05-18 AS DATE))| +---+ | 2| +---+ scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, '-MM-dd'))").show() +--+ |date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS TIMESTAMP))| +--+ | 2020-05-18 00:00:00| +--+ {code} * 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this week is Sunday, right? * date_trunc('week', '2020-05-18') => Monday, the 1st of of this week if Monday, see a tiny discrepancy? {code:sql} -- MySql behavior SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u') FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x; ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, '%YW%u')|DATE_FORMAT(ds, '%YW%V')| ---|-|||| 2020-05-18 19:20:39|2|2020W20 |2020W21 |2020W20 | {code} The request for this JIRA is: should we allow date_trunc() and dayofweek() support different 1st day of week option (SUNDAY or MONDAY)? > Inconsistent settings on the first day of a week > > > Key: SPARK-16423 > URL: https://issues.apache.org/jira/browse/SPARK-16423 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Major > Labels: bulk-closed > > For the function {{WeekOfYear}}, we explicitly set the first day of the week > to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. > So, we are using the default first day of the week based on the locale > setting (see > https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-). > > Let's do a survey on what other databases do and make the setting consistent. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20732) Copy cache data when node is being shut down
[ https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-20732. -- Fix Version/s: 3.1.0 Resolution: Fixed > Copy cache data when node is being shut down > > > Key: SPARK-20732 > URL: https://issues.apache.org/jira/browse/SPARK-20732 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Holden Karau >Assignee: Prakhar Jain >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31555) Improve cache block migration
[ https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31555: - Description: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas 5) Log the attempt number for debugging 6) Clarify the logic for determining the number of replicas 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count to wait for the executors to come up. imho this is the least important. was: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas 5) Log the attempt number for debugging 6) Clarify the logic for determining the number of replicas > Improve cache block migration > - > > Key: SPARK-31555 > URL: https://issues.apache.org/jira/browse/SPARK-31555 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We should explore the following improvements to cache block migration: > 1) Peer selection (right now may overbalance on certain peers) > 2) Do we need to configure the number of blocks to be migrated at the same > time > 3) Are there any blocks we don't need to replicate (e.g. they are already > stored on the desired number of executors even once we remove the executors > slated for decommissioning). > 4) Do we want to prioritize migrating blocks with no replicas > 5) Log the attempt number for debugging > 6) Clarify the logic for determining the number of replicas > 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count > to wait for the executors to come up. imho this is the least important. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31555) Improve cache block migration
[ https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31555: - Description: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas 5) Log the attempt number for debugging 6) Clarify the logic for determining the number of replicas was: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas 5) Log the attempt number for debugging > Improve cache block migration > - > > Key: SPARK-31555 > URL: https://issues.apache.org/jira/browse/SPARK-31555 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We should explore the following improvements to cache block migration: > 1) Peer selection (right now may overbalance on certain peers) > 2) Do we need to configure the number of blocks to be migrated at the same > time > 3) Are there any blocks we don't need to replicate (e.g. they are already > stored on the desired number of executors even once we remove the executors > slated for decommissioning). > 4) Do we want to prioritize migrating blocks with no replicas > 5) Log the attempt number for debugging > 6) Clarify the logic for determining the number of replicas -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31555) Improve cache block migration
[ https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31555: - Description: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas 5) Log the attempt number for debugging was: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas > Improve cache block migration > - > > Key: SPARK-31555 > URL: https://issues.apache.org/jira/browse/SPARK-31555 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We should explore the following improvements to cache block migration: > 1) Peer selection (right now may overbalance on certain peers) > 2) Do we need to configure the number of blocks to be migrated at the same > time > 3) Are there any blocks we don't need to replicate (e.g. they are already > stored on the desired number of executors even once we remove the executors > slated for decommissioning). > 4) Do we want to prioritize migrating blocks with no replicas > 5) Log the attempt number for debugging -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31555) Improve cache block migration
[ https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31555: - Description: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Are there any blocks we don't need to replicate (e.g. they are already stored on the desired number of executors even once we remove the executors slated for decommissioning). 4) Do we want to prioritize migrating blocks with no replicas was: We should explore the following improvements to cache block migration: 1) Peer selection (right now may overbalance on certain peers) 2) Do we need to configure the number of blocks to be migrated at the same time 3) Do we want to prioritize migrating blocks with no replicas > Improve cache block migration > - > > Key: SPARK-31555 > URL: https://issues.apache.org/jira/browse/SPARK-31555 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 3.1.0 >Reporter: Holden Karau >Priority: Major > > We should explore the following improvements to cache block migration: > 1) Peer selection (right now may overbalance on certain peers) > 2) Do we need to configure the number of blocks to be migrated at the same > time > 3) Are there any blocks we don't need to replicate (e.g. they are already > stored on the desired number of executors even once we remove the executors > slated for decommissioning). > 4) Do we want to prioritize migrating blocks with no replicas > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array
[ https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30267: Affects Version/s: 3.0.0 > avro deserializer: ArrayList cannot be cast to GenericData$Array > > > Key: SPARK-30267 > URL: https://issues.apache.org/jira/browse/SPARK-30267 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4, 3.0.0 >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > > On some more complex avro objects, the Avro Deserializer fails with the > following stack trace: > {code} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > org.apache.avro.generic.GenericData$Array > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56) > at > org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70) > {code} > This is because the Deserializer assumes that an array is always of the very > specific {{org.apache.avro.generic.GenericData$Array}} which is not always > the case. > Making it a normal list works. > A github PR is coming up to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array
[ https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-30267: Fix Version/s: (was: 3.0.0) > avro deserializer: ArrayList cannot be cast to GenericData$Array > > > Key: SPARK-30267 > URL: https://issues.apache.org/jira/browse/SPARK-30267 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Steven Aerts >Assignee: Steven Aerts >Priority: Major > > On some more complex avro objects, the Avro Deserializer fails with the > following stack trace: > {code} > java.lang.ClassCastException: java.util.ArrayList cannot be cast to > org.apache.avro.generic.GenericData$Array > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329) > at > org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56) > at > org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70) > {code} > This is because the Deserializer assumes that an array is always of the very > specific {{org.apache.avro.generic.GenericData$Array}} which is not always > the case. > Making it a normal list works. > A github PR is coming up to fix this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28554) implement basic catalog functionalities
[ https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28554: Affects Version/s: (was: 3.0.0) 3.1.0 > implement basic catalog functionalities > --- > > Key: SPARK-28554 > URL: https://issues.apache.org/jira/browse/SPARK-28554 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28554) implement basic catalog functionalities
[ https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-28554: Fix Version/s: (was: 3.0.0) > implement basic catalog functionalities > --- > > Key: SPARK-28554 > URL: https://issues.apache.org/jira/browse/SPARK-28554 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29441) Unable to Alter table column type in spark.
[ https://issues.apache.org/jira/browse/SPARK-29441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110436#comment-17110436 ] krishnendu mukherjee commented on SPARK-29441: -- also column name is not being altered > Unable to Alter table column type in spark. > --- > > Key: SPARK-29441 > URL: https://issues.apache.org/jira/browse/SPARK-29441 > Project: Spark > Issue Type: Improvement > Components: Spark Shell >Affects Versions: 2.3.1 > Environment: spark -2.3 > hadoop -2.4 >Reporter: prudhviraj >Priority: Major > > Unable to alter table column type in spark. > scala> spark.sql("""alter table tablename change col1 col1 string""") > org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not > supported for changing column 'col1' with type 'LongType' to 'col1' with type > 'StringType'; -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31751) spark serde property path overwrites table property location
Nithin created SPARK-31751: -- Summary: spark serde property path overwrites table property location Key: SPARK-31751 URL: https://issues.apache.org/jira/browse/SPARK-31751 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.1 Reporter: Nithin This is an issue that have caused us so many data errors. 1) using spark ( with hive context enabled ) df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}]) df.write.format("orc").option("compression", "ZLIB").mode("overwrite").saveAsTable('test_spark'); 2) from hive alter table test_spark rename to test_spark2 3)from spark-sql from command line ( note : not pyspark or spark-shell ) select * from test_spark2 will give output NULL NULL NULL Time taken: 0.334 seconds, Fetched 1 row(s) This will throw NULL because , pyspark write API will add a serde property called path into the hive metastore. when hive renames the table , it do not understand this serde and hence keep it as it is. Now when spark-sql tries to read it , it will honor the serde property first and then tries to read from the non-existent hdfs location. If it had given an error , then also it would have been fine , but throwing out NULL will cause applications to fail pretty bad. Spark claims to support hive tables , hence it should respect hive metastore location property rather than spark serde property when trying to read a table. This cannot be classified as a expected behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue
[ https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110409#comment-17110409 ] Shane Knapp commented on SPARK-31693: - apache.org blacklisted us: ''' That IP was banned 5 days ago for more than 1,000 download page views per 24 hours (1020 >= limit of 1000). Typically this is due to some misconfigured CI system hitting our systems to download packages instead of using a local cache. ''' i asked them to un-ban us while i investigate the root cause. > Investigate AmpLab Jenkins server network issue > --- > > Key: SPARK-31693 > URL: https://issues.apache.org/jira/browse/SPARK-31693 > Project: Spark > Issue Type: Bug > Components: Project Infra >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Critical > > Given the series of failures in Spark packaging Jenkins job, it seems that > there is a network issue in AmbLab Jenkins cluster. > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/ > - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay. > - The node failed to download the maven mirror. (SPARK-31691) -> The primary > host is okay. > - The node failed to communicate repository.apache.org. (Current master > branch Jenkins job failure) > {code} > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) > on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve > remote metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could > not transfer metadata > org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to > apache.snapshots.https > (https://repository.apache.org/content/repositories/snapshots): Transfer > failed for > https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml: > Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] > failed: Connection timed out (Connection timed out) -> [Help 1] > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType
[ https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31750: Assignee: (was: Apache Spark) > Eliminate UpCast if child's dataType is DecimalType > --- > > Key: SPARK-31750 > URL: https://issues.apache.org/jira/browse/SPARK-31750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > sql("select cast(11 as decimal(38, 0)) as > d") > .write.mode("overwrite") > .parquet(f.getAbsolutePath) > spark.read.parquet(f.getAbsolutePath).as[BigDecimal] > {code} > {code:java} > [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from > decimal(38,0) to decimal(38,18). > [info] The type path of the target object is: > [info] - root class: "scala.math.BigDecimal" > [info] You can either add an explicit cast to the input data or choose a > higher precision type of the field in the target object; > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType
[ https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110367#comment-17110367 ] Apache Spark commented on SPARK-31750: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28572 > Eliminate UpCast if child's dataType is DecimalType > --- > > Key: SPARK-31750 > URL: https://issues.apache.org/jira/browse/SPARK-31750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > sql("select cast(11 as decimal(38, 0)) as > d") > .write.mode("overwrite") > .parquet(f.getAbsolutePath) > spark.read.parquet(f.getAbsolutePath).as[BigDecimal] > {code} > {code:java} > [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from > decimal(38,0) to decimal(38,18). > [info] The type path of the target object is: > [info] - root class: "scala.math.BigDecimal" > [info] You can either add an explicit cast to the input data or choose a > higher precision type of the field in the target object; > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType
[ https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110369#comment-17110369 ] Apache Spark commented on SPARK-31750: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28572 > Eliminate UpCast if child's dataType is DecimalType > --- > > Key: SPARK-31750 > URL: https://issues.apache.org/jira/browse/SPARK-31750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > sql("select cast(11 as decimal(38, 0)) as > d") > .write.mode("overwrite") > .parquet(f.getAbsolutePath) > spark.read.parquet(f.getAbsolutePath).as[BigDecimal] > {code} > {code:java} > [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from > decimal(38,0) to decimal(38,18). > [info] The type path of the target object is: > [info] - root class: "scala.math.BigDecimal" > [info] You can either add an explicit cast to the input data or choose a > higher precision type of the field in the target object; > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType
[ https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31750: Assignee: Apache Spark > Eliminate UpCast if child's dataType is DecimalType > --- > > Key: SPARK-31750 > URL: https://issues.apache.org/jira/browse/SPARK-31750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > > {code:java} > sql("select cast(11 as decimal(38, 0)) as > d") > .write.mode("overwrite") > .parquet(f.getAbsolutePath) > spark.read.parquet(f.getAbsolutePath).as[BigDecimal] > {code} > {code:java} > [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from > decimal(38,0) to decimal(38,18). > [info] The type path of the target object is: > [info] - root class: "scala.math.BigDecimal" > [info] You can either add an explicit cast to the input data or choose a > higher precision type of the field in the target object; > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType
[ https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31750: - Summary: Eliminate UpCast if child's dataType is DecimalType (was: Eliminate UpCast if chid's dataType is DecimalType) > Eliminate UpCast if child's dataType is DecimalType > --- > > Key: SPARK-31750 > URL: https://issues.apache.org/jira/browse/SPARK-31750 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: wuyi >Priority: Major > > > {code:java} > sql("select cast(11 as decimal(38, 0)) as > d") > .write.mode("overwrite") > .parquet(f.getAbsolutePath) > spark.read.parquet(f.getAbsolutePath).as[BigDecimal] > {code} > {code:java} > [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from > decimal(38,0) to decimal(38,18). > [info] The type path of the target object is: > [info] - root class: "scala.math.BigDecimal" > [info] You can either add an explicit cast to the input data or choose a > higher precision type of the field in the target object; > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) > [info] at > org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) > [info] at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31750) Eliminate UpCast if chid's dataType is DecimalType
wuyi created SPARK-31750: Summary: Eliminate UpCast if chid's dataType is DecimalType Key: SPARK-31750 URL: https://issues.apache.org/jira/browse/SPARK-31750 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: wuyi {code:java} sql("select cast(11 as decimal(38, 0)) as d") .write.mode("overwrite") .parquet(f.getAbsolutePath) spark.read.parquet(f.getAbsolutePath).as[BigDecimal] {code} {code:java} [info] org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18). [info] The type path of the target object is: [info] - root class: "scala.math.BigDecimal" [info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object; [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31738) Describe 'L' and 'M' month pattern letters
[ https://issues.apache.org/jira/browse/SPARK-31738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31738. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28558 [https://github.com/apache/spark/pull/28558] > Describe 'L' and 'M' month pattern letters > -- > > Key: SPARK-31738 > URL: https://issues.apache.org/jira/browse/SPARK-31738 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 3.0.0, 3.1.0 >Reporter: Maxim Gekk >Assignee: Apache Spark >Priority: Major > Fix For: 3.0.0 > > > # Describe difference between 'M' and 'L' pattern letters > # Add examples -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation
[ https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-31739: Assignee: David Toneian > Docstring syntax issues prevent proper compilation of documentation > --- > > Key: SPARK-31739 > URL: https://issues.apache.org/jira/browse/SPARK-31739 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.5 >Reporter: David Toneian >Assignee: David Toneian >Priority: Trivial > > Some docstrings contain mistakes, like missing or spurious spaces, which > prevent the documentation from being rendered as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation
[ https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31739. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28559 [https://github.com/apache/spark/pull/28559] > Docstring syntax issues prevent proper compilation of documentation > --- > > Key: SPARK-31739 > URL: https://issues.apache.org/jira/browse/SPARK-31739 > Project: Spark > Issue Type: Bug > Components: Documentation >Affects Versions: 2.4.5 >Reporter: David Toneian >Assignee: David Toneian >Priority: Trivial > Fix For: 3.1.0 > > > Some docstrings contain mistakes, like missing or spurious spaces, which > prevent the documentation from being rendered as intended. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)
Tamas Jambor created SPARK-31749: Summary: Allow to set owner reference for the driver pod (cluster mode) Key: SPARK-31749 URL: https://issues.apache.org/jira/browse/SPARK-31749 Project: Spark Issue Type: New Feature Components: Kubernetes Affects Versions: 2.4.5 Reporter: Tamas Jambor Currently there is no way to pass ownerReferences to the driver pod in cluster mode. This makes it difficult for the upstream process to clean up pods after they completed. Something like this would be useful: spark.kubernetes.driver.ownerReferences.[Name] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys
[ https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110140#comment-17110140 ] krishnendu mukherjee commented on SPARK-21784: -- has this been addded to spark latest release? > Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign > keys > -- > > Key: SPARK-21784 > URL: https://issues.apache.org/jira/browse/SPARK-21784 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Suresh Thalamati >Priority: Major > > Currently Spark SQL does not have DDL support to define primary key , and > foreign key constraints. This Jira is to add DDL support to define primary > key and foreign key informational constraint using ALTER TABLE syntax. These > constraints will be used in query optimization and you can find more details > about this in the spec in SPARK-19842 > *Syntax :* > {code} > ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName] > (PRIMARY KEY (col_names) | > FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)]) > [VALIDATE | NOVALIDATE] [RELY | NORELY] > {code} > Examples : > {code:sql} > ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY > ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES > employee(empno) NOVALIDATE NORELY > {code} > *Constraint name generated by the system:* > {code:sql} > ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY > ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) > VALIDATE RELY; > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110103#comment-17110103 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28570 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110102#comment-17110102 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28570 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31748) Document resource module in PySpark doc and rename/move classes
[ https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110064#comment-17110064 ] Apache Spark commented on SPARK-31748: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28569 > Document resource module in PySpark doc and rename/move classes > --- > > Key: SPARK-31748 > URL: https://issues.apache.org/jira/browse/SPARK-31748 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-29641 and SPARK-28234 added new pyspark.resrouce module. > It should be documented as it's an API. > Also, the current structure is as follows: > {code} > pyspark > ├── resource > │ ├── executorrequests.py > │ │ ├── class ExecutorResourceRequest > │ │ └── class ExecutorResourceRequests > │ ├── taskrequests.py > │ │ ├── class TaskResourceRequest > │ │ └── class TaskResourceRequests > │ ├── resourceprofilebuilder.py > │ │ └── class ResourceProfileBuilder > │ ├── resourceprofile.py > │ │ └── class ResourceProfile > └── resourceinformation > └── class ResourceInformation > {code} > Might better put into fewer and simpler modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31748) Document resource module in PySpark doc and rename/move classes
[ https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31748: Assignee: (was: Apache Spark) > Document resource module in PySpark doc and rename/move classes > --- > > Key: SPARK-31748 > URL: https://issues.apache.org/jira/browse/SPARK-31748 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-29641 and SPARK-28234 added new pyspark.resrouce module. > It should be documented as it's an API. > Also, the current structure is as follows: > {code} > pyspark > ├── resource > │ ├── executorrequests.py > │ │ ├── class ExecutorResourceRequest > │ │ └── class ExecutorResourceRequests > │ ├── taskrequests.py > │ │ ├── class TaskResourceRequest > │ │ └── class TaskResourceRequests > │ ├── resourceprofilebuilder.py > │ │ └── class ResourceProfileBuilder > │ ├── resourceprofile.py > │ │ └── class ResourceProfile > └── resourceinformation > └── class ResourceInformation > {code} > Might better put into fewer and simpler modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31748) Document resource module in PySpark doc and rename/move classes
[ https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110062#comment-17110062 ] Apache Spark commented on SPARK-31748: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/28569 > Document resource module in PySpark doc and rename/move classes > --- > > Key: SPARK-31748 > URL: https://issues.apache.org/jira/browse/SPARK-31748 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-29641 and SPARK-28234 added new pyspark.resrouce module. > It should be documented as it's an API. > Also, the current structure is as follows: > {code} > pyspark > ├── resource > │ ├── executorrequests.py > │ │ ├── class ExecutorResourceRequest > │ │ └── class ExecutorResourceRequests > │ ├── taskrequests.py > │ │ ├── class TaskResourceRequest > │ │ └── class TaskResourceRequests > │ ├── resourceprofilebuilder.py > │ │ └── class ResourceProfileBuilder > │ ├── resourceprofile.py > │ │ └── class ResourceProfile > └── resourceinformation > └── class ResourceInformation > {code} > Might better put into fewer and simpler modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31748) Document resource module in PySpark doc and rename/move classes
[ https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31748: Assignee: Apache Spark > Document resource module in PySpark doc and rename/move classes > --- > > Key: SPARK-31748 > URL: https://issues.apache.org/jira/browse/SPARK-31748 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > SPARK-29641 and SPARK-28234 added new pyspark.resrouce module. > It should be documented as it's an API. > Also, the current structure is as follows: > {code} > pyspark > ├── resource > │ ├── executorrequests.py > │ │ ├── class ExecutorResourceRequest > │ │ └── class ExecutorResourceRequests > │ ├── taskrequests.py > │ │ ├── class TaskResourceRequest > │ │ └── class TaskResourceRequests > │ ├── resourceprofilebuilder.py > │ │ └── class ResourceProfileBuilder > │ ├── resourceprofile.py > │ │ └── class ResourceProfile > └── resourceinformation > └── class ResourceInformation > {code} > Might better put into fewer and simpler modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31748) Document resource module in PySpark doc and rename/move classes
Hyukjin Kwon created SPARK-31748: Summary: Document resource module in PySpark doc and rename/move classes Key: SPARK-31748 URL: https://issues.apache.org/jira/browse/SPARK-31748 Project: Spark Issue Type: Improvement Components: PySpark, Spark Core Affects Versions: 3.0.0 Reporter: Hyukjin Kwon SPARK-29641 and SPARK-28234 added new pyspark.resrouce module. It should be documented as it's an API. Also, the current structure is as follows: {code} pyspark ├── resource │ ├── executorrequests.py │ │ ├── class ExecutorResourceRequest │ │ └── class ExecutorResourceRequests │ ├── taskrequests.py │ │ ├── class TaskResourceRequest │ │ └── class TaskResourceRequests │ ├── resourceprofilebuilder.py │ │ └── class ResourceProfileBuilder │ ├── resourceprofile.py │ │ └── class ResourceProfile └── resourceinformation └── class ResourceInformation {code} Might better put into fewer and simpler modules. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110054#comment-17110054 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28568 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110043#comment-17110043 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28567 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs
[ https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110044#comment-17110044 ] Apache Spark commented on SPARK-31710: -- User 'GuoPhilipse' has created a pull request for this issue: https://github.com/apache/spark/pull/28567 > result is the not the same when query and execute jobs > -- > > Key: SPARK-31710 > URL: https://issues.apache.org/jira/browse/SPARK-31710 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 > Environment: hdp:2.7.7 > spark:2.4.5 >Reporter: philipse >Priority: Major > > Hi Team > Steps to reproduce. > {code:java} > create table test(id bigint); > insert into test select 1586318188000; > create table test1(id bigint) partitioned by (year string); > insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) > from test; > {code} > let's check the result. > Case 1: > *select * from test1;* > 234 | 52238-06-04 13:06:400.0 > --the result is wrong > Case 2: > *select 234,cast(id as TIMESTAMP) from test;* > > java.lang.IllegalArgumentException: Timestamp format must be -mm-dd > hh:mm:ss[.f] > at java.sql.Timestamp.valueOf(Timestamp.java:237) > at > org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441) > at > org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421) > at > org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530) > at org.apache.hive.beeline.Rows$Row.(Rows.java:166) > at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43) > at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756) > at org.apache.hive.beeline.Commands.execute(Commands.java:826) > at org.apache.hive.beeline.Commands.sql(Commands.java:670) > at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974) > at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810) > at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767) > at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480) > at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at org.apache.hadoop.util.RunJar.run(RunJar.java:226) > at org.apache.hadoop.util.RunJar.main(RunJar.java:141) > Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0) > > I try hive,it works well,and the convert is fine and correct > {code:java} > select 234,cast(id as TIMESTAMP) from test; > 234 2020-04-08 11:56:28 > {code} > Two questions: > q1: > if we forbid this convert,should we keep all cases the same? > q2: > if we allow the convert in some cases, should we decide the long length, for > the code seems to force to convert to ns with times*100 nomatter how long > the data is,if it convert to timestamp with incorrect length, we can raise > the error. > {code:java} > // // converting seconds to us > private[this] def longToTimestamp(t: Long): Long = t * 100L{code} > > Thanks! > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER
[ https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376 ] Rafael edited comment on SPARK-20427 at 5/18/20, 7:27 AM: -- Hey guys, I encountered an issue related to precision issues. Now the code expects the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? UPDATE: Solved with the custom scheme. was (Author: kyrdan): Hey guys, I encountered an issue related to precision issues. Now the code expects the Decimal type we need to have in JDBC metadata precision and scale. [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414] I found out that in the OracleDB it is valid to have Decimal without these data. When I do a query read metadata for such column I'm getting DATA_PRECISION = Null, and DATA_SCALE = Null. Then when I run the `spark-sql` I'm getting such error: {code:java} java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 exceeds max precision 38 at scala.Predef$.require(Predef.scala:224) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407) {code} Do you have a work around how spark-sql can work with such cases? > Issue with Spark interpreting Oracle datatype NUMBER > > > Key: SPARK-20427 > URL: https://issues.apache.org/jira/browse/SPARK-20427 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.0 >Reporter: Alexander Andrushenko >Assignee: Yuming Wang >Priority: Major > Fix For: 2.3.0 > > > In Oracle exists data type NUMBER. When defining a filed in a table of type > NUMBER the field has two components, precision and scale. > For example, NUMBER(p,s) has precision p and scale s. > Precision can range from 1 to 38. > Scale can range from -84 to 127. > When reading such a filed Spark can create numbers with precision exceeding > 38. In our case it has created fields with precision 44, > calculated as sum of the precision (in our case 34 digits) and the scale (10): > "...java.lang.IllegalArgumentException: requirement failed: Decimal precision > 44 exceeds max precision 38...". > The result was, that a data frame was read from a table on one schema but > could not be inserted in the identical table on other schema. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org