date:20200518

[jira] [Updated] (SPARK-31756) Add real headless browser support for UI test

2020-05-18 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-31756:
---
Description: 
In the current master, there are two problems for UI test.

1. Lots of tests especially JavaScript related ones are done manually.

Appearance is better to be confirmed by our eyes but logic should be tested by 
test cases ideally.

 

2. Compared to the real web browsers, HtmlUnit doesn't seem to support 
JavaScriopt enough.

I added a JavaScript related test before for SPARK-31534 using HtmlUnit which 
is simple library based headless browser for test.

The test I added works somehow but some JavaScript related error is shown in 
unit-tests.log.
{code:java}
=== EXCEPTION START 
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option 
"sanitizeFn" provided type "window" but expected type "(null|function)". 
(http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
        at 
net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
        at 
net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
        at 
com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
        at 
com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
        at java.lang.Thread.run(Thread.java:748)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: 
Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type 
"(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at 
net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
        at 
net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
        at 
net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
        at 
net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
        at 
com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
        at 
net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
        ... 10 more
JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type 
"window" but expected type "(null|function)".
== CALLING JAVASCRIPT ==
  function () {
      throw e;
  }
=== EXCEPTION END {code}
 

I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not 
working even though it works on real browsers like Chrome, Safari and Firefox 
without error.
{code:java}
[info] UISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 
seconds, 745 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 2 
times over 12.910785232 seconds. Last failure message: 
com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to 
undefined "regeneratorRuntime" in strict mode 
(http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1){code}
 

 To resolve those problems, it's better to support headless browser for UI test.

  was:
In the current master, there are two problems for UI test.

1. Lots of tests especially JavaScript related ones are done manually.

Appearance is better to be confirmed by our eyes but logic should be tested by 
test cases ideally.

 

2. Compared to the real web browsers, HtmlUnit doesn't seem to support 
JavaScriopt enough.

I added a JavaScript related test before for SPARK-31534 using HtmlUnit which 
is simple library based headless browser for test.

The test I added works somehow but some JavaScript related error is shown in 
unit-tests.log.
{code:java}
 === EXCEPTION

[jira] [Created] (SPARK-31756) Add real headless browser support for UI test

2020-05-18 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-31756:
--

 Summary: Add real headless browser support for UI test
 Key: SPARK-31756
 URL: https://issues.apache.org/jira/browse/SPARK-31756
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


In the current master, there are two problems for UI test.

1. Lots of tests especially JavaScript related ones are done manually.

Appearance is better to be confirmed by our eyes but logic should be tested by 
test cases ideally.

 

2. Compared to the real web browsers, HtmlUnit doesn't seem to support 
JavaScriopt enough.

I added a JavaScript related test before for SPARK-31534 using HtmlUnit which 
is simple library based headless browser for test.

The test I added works somehow but some JavaScript related error is shown in 
unit-tests.log.
{code:java}
 === EXCEPTION START 
Exception class=[net.sourceforge.htmlunit.corejs.javascript.JavaScriptException]
com.gargoylesoftware.htmlunit.ScriptException: Error: TOOLTIP: Option 
"sanitizeFn" provided type "window" but expected type "(null|function)". 
(http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:904)
        at 
net.sourceforge.htmlunit.corejs.javascript.Context.call(Context.java:628)
        at 
net.sourceforge.htmlunit.corejs.javascript.ContextFactory.call(ContextFactory.java:515)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:835)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.callFunction(JavaScriptEngine.java:807)
        at 
com.gargoylesoftware.htmlunit.InteractivePage.executeJavaScriptFunctionIfPossible(InteractivePage.java:216)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptFunctionJob.runJavaScript(JavaScriptFunctionJob.java:52)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptExecutionJob.run(JavaScriptExecutionJob.java:102)
        at 
com.gargoylesoftware.htmlunit.javascript.background.JavaScriptJobManagerImpl.runSingleJob(JavaScriptJobManagerImpl.java:426)
        at 
com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor.run(DefaultJavaScriptExecutor.java:157)
        at java.lang.Thread.run(Thread.java:748)
Caused by: net.sourceforge.htmlunit.corejs.javascript.JavaScriptException: 
Error: TOOLTIP: Option "sanitizeFn" provided type "window" but expected type 
"(null|function)". (http://192.168.1.209:60724/static/jquery-3.4.1.min.js#2)
        at 
net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpretLoop(Interpreter.java:1009)
        at 
net.sourceforge.htmlunit.corejs.javascript.Interpreter.interpret(Interpreter.java:800)
        at 
net.sourceforge.htmlunit.corejs.javascript.InterpretedFunction.call(InterpretedFunction.java:105)
        at 
net.sourceforge.htmlunit.corejs.javascript.ContextFactory.doTopCall(ContextFactory.java:413)
        at 
com.gargoylesoftware.htmlunit.javascript.HtmlUnitContextFactory.doTopCall(HtmlUnitContextFactory.java:252)
        at 
net.sourceforge.htmlunit.corejs.javascript.ScriptRuntime.doTopCall(ScriptRuntime.java:3264)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$4.doRun(JavaScriptEngine.java:828)
        at 
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:889)
        ... 10 more
JavaScriptException value = Error: TOOLTIP: Option "sanitizeFn" provided type 
"window" but expected type "(null|function)".
== CALLING JAVASCRIPT ==
  function () {
      throw e;
  }
=== EXCEPTION END {code}
 

I tried to upgrade HtmlUnit to 2.40.0 but what is worse, the test become not 
working even though it works on real browsers like Chrome, Safari and Firefox 
without error.
{code:java}
 [info] UISeleniumSuite:
[info] - SPARK-31534: text for tooltip should be escaped *** FAILED *** (17 
seconds, 745 milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 2 
times over 12.910785232 seconds. Last failure message: 
com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: Assignment to 
undefined "regeneratorRuntime" in strict mode 
(http://192.168.1.209:62132/static/vis-timeline-graph2d.min.js#52(Function)#1){code}
 

 To resolve those problems, it's better to support headless browser for UI test.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31440) Improve SQL Rest API

2020-05-18 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-31440.

Resolution: Fixed

This issue is resolved in https://github.com/apache/spark/pull/28208

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json, 
> improved_version_May1th.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31440) Improve SQL Rest API

2020-05-18 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-31440:
--

Assignee: Eren Avsarogullari

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Assignee: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json, 
> improved_version_May1th.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110888#comment-17110888
 ] 

Apache Spark commented on SPARK-31399:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/28577

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.sc

[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110889#comment-17110889
 ] 

Apache Spark commented on SPARK-31399:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/28577

> Closure cleaner broken in Scala 2.12
> 
>
> Key: SPARK-31399
> URL: https://issues.apache.org/jira/browse/SPARK-31399
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Wenchen Fan
>Assignee: Kris Mok
>Priority: Blocker
> Fix For: 3.0.0
>
>
> The `ClosureCleaner` only support Scala functions and it uses the following 
> check to catch closures
> {code}
>   // Check whether a class represents a Scala closure
>   private def isClosure(cls: Class[_]): Boolean = {
> cls.getName.contains("$anonfun$")
>   }
> {code}
> This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala 
> functions become Java lambdas.
> As an example, the following code works well in Spark 2.4 Spark Shell:
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> import org.apache.spark.sql.functions.lit
> defined class Foo
> col: org.apache.spark.sql.Column = 123
> df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20
> {code}
> But fails in 3.0
> {code}
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
>   at org.apache.spark.SparkContext.clean(SparkContext.scala:2371)
>   at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
>   at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
>   at org.apache.spark.rdd.RDD.map(RDD.scala:421)
>   ... 39 elided
> Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column
> Serialization stack:
>   - object not serializable (class: org.apache.spark.sql.Column, value: 
> 123)
>   - field (class: $iw, name: col, type: class org.apache.spark.sql.Column)
>   - object (class $iw, $iw@2d87ac2b)
>   - element of array (index: 0)
>   - array (class [Ljava.lang.Object;, size 1)
>   - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, 
> type: class [Ljava.lang.Object;)
>   - object (class java.lang.invoke.SerializedLambda, 
> SerializedLambda[capturingClass=class $iw, 
> functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;,
>  implementation=invokeStatic 
> $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, 
> instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1])
>   - writeReplace data (class: java.lang.invoke.SerializedLambda)
>   - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43)
>   at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
>   at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
>   at 
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393)
>   ... 47 more
> {code}
> **Apache Spark 2.4.5 with Scala 2.12**
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.4.5
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :pa
> // Entering paste mode (ctrl-D to finish)
> import org.apache.spark.sql.functions.lit
> case class Foo(id: String)
> val col = lit("123")
> val df = sc.range(0,10,1,1).map { _ => Foo("") }
> // Exiting paste mode, now interpreting.
> org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.sc

[jira] [Assigned] (SPARK-31755) allow missing year/hour when parsing date/timestamp

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31755:


Assignee: Apache Spark  (was: Wenchen Fan)

> allow missing year/hour when parsing date/timestamp
> ---
>
> Key: SPARK-31755
> URL: https://issues.apache.org/jira/browse/SPARK-31755
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31755) allow missing year/hour when parsing date/timestamp

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31755:


Assignee: Wenchen Fan  (was: Apache Spark)

> allow missing year/hour when parsing date/timestamp
> ---
>
> Key: SPARK-31755
> URL: https://issues.apache.org/jira/browse/SPARK-31755
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31755) allow missing year/hour when parsing date/timestamp

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110886#comment-17110886
 ] 

Apache Spark commented on SPARK-31755:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/28576

> allow missing year/hour when parsing date/timestamp
> ---
>
> Key: SPARK-31755
> URL: https://issues.apache.org/jira/browse/SPARK-31755
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31755) allow missing year/hour when parsing date/timestamp

2020-05-18 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31755:
---

 Summary: allow missing year/hour when parsing date/timestamp
 Key: SPARK-31755
 URL: https://issues.apache.org/jira/browse/SPARK-31755
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30768) Constraints inferred from inequality attributes

2020-05-18 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110877#comment-17110877
 ] 

Yuming Wang commented on SPARK-30768:
-

Teradata support this: 
https://docs.teradata.com/reader/Ws7YT1jvRK2vEr1LpVURug/V~FCwD9BL7gY4ac3WwHInw?section=xcg1472241575102__application_of_transitive_closure_section

> Constraints inferred from inequality attributes
> ---
>
> Key: SPARK-30768
> URL: https://issues.apache.org/jira/browse/SPARK-30768
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> How to reproduce:
> {code:sql}
> create table SPARK_30768_1(c1 int, c2 int);
> create table SPARK_30768_2(c1 int, c2 int);
> {code}
> *Spark SQL*:
> {noformat}
> spark-sql> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> == Physical Plan ==
> *(3) Project [c1#5, c2#6]
> +- BroadcastNestedLoopJoin BuildRight, Inner, (c1#5 > c1#7)
>:- *(1) Project [c1#5, c2#6]
>:  +- *(1) Filter (isnotnull(c1#5) AND (c1#5 = 3))
>: +- *(1) ColumnarToRow
>:+- FileScan parquet default.spark_30768_1[c1#5,c2#6] Batched: 
> true, DataFilters: [isnotnull(c1#5), (c1#5 = 3)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,3)], 
> ReadSchema: struct
>+- BroadcastExchange IdentityBroadcastMode, [id=#60]
>   +- *(2) Project [c1#7]
>  +- *(2) Filter isnotnull(c1#7)
> +- *(2) ColumnarToRow
>+- FileScan parquet default.spark_30768_2[c1#7] Batched: true, 
> DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: 
> InMemoryFileIndex[file:/Users/yumwang/Downloads/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehous...,
>  PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: 
> struct
> {noformat}
> *Hive* support this feature:
> {noformat}
> hive> explain select t1.* from SPARK_30768_1 t1 join SPARK_30768_2 t2 on 
> (t1.c1 > t2.c1) where t1.c1 = 3;
> Warning: Map Join MAPJOIN[13][bigTable=?] in task 'Stage-3:MAPRED' is a cross 
> product
> OK
> STAGE DEPENDENCIES:
>   Stage-4 is a root stage
>   Stage-3 depends on stages: Stage-4
>   Stage-0 depends on stages: Stage-3
> STAGE PLANS:
>   Stage: Stage-4
> Map Reduce Local Work
>   Alias -> Map Local Tables:
> $hdt$_0:t1
>   Fetch Operator
> limit: -1
>   Alias -> Map Local Operator Tree:
> $hdt$_0:t1
>   TableScan
> alias: t1
> filterExpr: (c1 = 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 = 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: c2 (type: int)
> outputColumnNames: _col1
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> HashTable Sink Operator
>   keys:
> 0
> 1
>   Stage: Stage-3
> Map Reduce
>   Map Operator Tree:
>   TableScan
> alias: t2
> filterExpr: (c1 < 3) (type: boolean)
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL Column 
> stats: NONE
> Filter Operator
>   predicate: (c1 < 3) (type: boolean)
>   Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> Statistics: Num rows: 1 Data size: 0 Basic stats: PARTIAL 
> Column stats: NONE
> Map Join Operator
>   condition map:
>Inner Join 0 to 1
>   keys:
> 0
> 1
>   outputColumnNames: _col1
>   Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
>   Select Operator
> expressions: 3 (type: int), _col1 (type: int)
> outputColumnNames: _col0, _col1
> Statistics: Num rows: 1 Data size: 1 Basic stats: PARTIAL 
> Column stats: NONE
> File Output Operator
>   compressed: false
>   Statistics: Num rows: 1 Data size: 1 Basic stats: 
> PARTIAL Column stats: NONE
>   table:
>   input format: 
> org.apache.hadoop.mapred.Sequenc

[jira] [Commented] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-18 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110873#comment-17110873
 ] 

Jungtaek Lim commented on SPARK-31754:
--

[~puviarasu]
Given the error comes from "generated code", you may want to turn the DEBUG log 
for below class and retrieve generated code, and paste these codes as well.

org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator

Btw higher priorities than major should go through committer's decision. I'll 
lower the priority and see the decision from committer. (Personally it looks 
like an edge-case, not meant to be a blocker.)

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$

[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31754:
-
Priority: Major  (was: Blocker)

> Spark Structured Streaming: NullPointerException in Stream Stream join
> --
>
> Key: SPARK-31754
> URL: https://issues.apache.org/jira/browse/SPARK-31754
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
> Environment: Spark Version : 2.4.0
> Hadoop Version : 3.0.0
>Reporter: Puviarasu
>Priority: Major
>  Labels: structured-streaming
>
> When joining 2 streams with watermarking and windowing we are getting 
> NullPointer Exception after running for few minutes. 
> After failure we analyzed the checkpoint offsets/sources and found the files 
> for which the application failed. These files are not having any null values 
> in the join columns. 
> We even started the job with the files and the application ran. From this we 
> concluded that the exception is not because of the data from the streams.
> *Code:*
>  
> {code:java}
> val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
> "1" )
>  val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
> "maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
> ->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
> "1" )
>  
> spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
>  
> spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
>  spark.sql("select * from source1 where eventTime1 is not null and col1 is 
> not null").withWatermark("eventTime1", "30 
> minutes").createTempView("viewNotNull1")
>  spark.sql("select * from source2 where eventTime2 is not null and col2 is 
> not null").withWatermark("eventTime2", "30 
> minutes").createTempView("viewNotNull2")
>  spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = 
> b.col2 and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + 
> interval 2 hours").createTempView("join")
>  val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
> "/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
>  spark.sql("select * from 
> join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
> seconds")).format("parquet").options(optionsMap3).start()
> {code}
>  
> *Exception:*
>  
> {code:java}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
> Aborting TaskSet 4.0 because task 0 (partition 0)
> cannot run anywhere due to node and executor blacklist.
> Most recent failure:
> Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
> at 
> org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
> at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at 
> org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
> at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583)
> at 
> org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(sta

[jira] [Updated] (SPARK-31706) add back the support of streaming update mode

2020-05-18 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31706?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-31706:

Priority: Blocker  (was: Major)

> add back the support of streaming update mode
> -
>
> Key: SPARK-31706
> URL: https://issues.apache.org/jira/browse/SPARK-31706
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-18 Thread Puviarasu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Puviarasu updated SPARK-31754:
--
Description: 
When joining 2 streams with watermarking and windowing we are getting 
NullPointer Exception after running for few minutes. 

After failure we analyzed the checkpoint offsets/sources and found the files 
for which the application failed. These files are not having any null values in 
the join columns. 

We even started the job with the files and the application ran. From this we 
concluded that the exception is not because of the data from the streams.

*Code:*

 
{code:java}
val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
"maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
"1" )
 val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
"maxFilesPerTrigger" -> "1", "latestFirst" -> "false", "fileNameOnly" 
->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
"1" )
 
spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
 
spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
 spark.sql("select * from source1 where eventTime1 is not null and col1 is not 
null").withWatermark("eventTime1", "30 minutes").createTempView("viewNotNull1")
 spark.sql("select * from source2 where eventTime2 is not null and col2 is not 
null").withWatermark("eventTime2", "30 minutes").createTempView("viewNotNull2")
 spark.sql("select * from viewNotNull1 a join viewNotNull2 b on a.col1 = b.col2 
and a.eventTime1 >= b.eventTime2 and a.eventTime1 <= b.eventTime2 + interval 2 
hours").createTempView("join")
 val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
"/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
 spark.sql("select * from 
join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
seconds")).format("parquet").options(optionsMap3).start()
{code}
 

*Exception:*

 
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 4.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583)
at 
org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:108)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.timeTakenMs(StreamingSymmetricHashJoinExec.scala:126)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1(StreamingSymmetricHashJ
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$processPartitions$1.apply$mcV$sp(St:361)
at 
org.apache.spark.util.CompletionIterator$$anon$1.completion(CompletionIterator.scala:44)
at 
org.apache.spark.util.CompletionIterat

[jira] [Created] (SPARK-31754) Spark Structured Streaming: NullPointerException in Stream Stream join

2020-05-18 Thread Puviarasu (Jira)

Puviarasu created SPARK-31754:
-

 Summary: Spark Structured Streaming: NullPointerException in 
Stream Stream join
 Key: SPARK-31754
 URL: https://issues.apache.org/jira/browse/SPARK-31754
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
 Environment: Spark Version : 2.4.0

Hadoop Version : 3.0.0
Reporter: Puviarasu


When joining 2 streams with watermarking and windowing we are getting 
NullPointer Exception after running for few minutes. 

After failure we analyzed the checkpoint offsets/sources and found the files 
for which the application failed. These files are not having any null values in 
the join columns. 

We even started the job with the files and the application ran. From this we 
concluded that the exception is not because of the data from the streams.

*Code:*

 
{code:java}

{code}
*val optionsMap1 = Map[String, String]("Path" -> "/path/to/source1", 
"maxFilesPerTrigger" ->  "1", "latestFirst" -> "false", "fileNameOnly" 
->"false", "checkpointLocation" -> "/path/to/checkpoint1", "rowsPerSecond" -> 
"1" )
val optionsMap2 = Map[String, String]("Path" -> "/path/to/source2", 
"maxFilesPerTrigger" ->  "1", "latestFirst" -> "false", "fileNameOnly" 
->"false", "checkpointLocation" -> "/path/to/checkpoint2", "rowsPerSecond" -> 
"1" )
spark.readStream.format("parquet").options(optionsMap1).load().createTempView("source1")
spark.readStream.format("parquet").options(optionsMap2).load().createTempView("source2")
spark.sql("select * from source1 where eventTime1 is not null and col1 is not 
null").withWatermark("eventTime1", "30 minutes").createTempView("viewNotNull1")
spark.sql("select * from source2 where eventTime2 is not null and col2 is not 
null").withWatermark("eventTime2", "30 minutes").createTempView("viewNotNull2")
spark.sql("select * from viewNotNull1 a join  viewNotNull2 b on a.col1 = b.col2 
and a.eventTime1 >= b.eventTime2 and  a.eventTime1 <= b.eventTime2 + interval 2 
hours").createTempView("join")
val optionsMap3 = Map[String, String]("compression" -> "snappy","path" -> 
"/path/to/sink", "checkpointLocation" -> "/path/to/checkpoint3")
spark.sql("select * from 
join").writeStream.outputMode("append").trigger(Trigger.ProcessingTime("5 
seconds")).format("parquet").options(optionsMap3).start()*

 

*Exception:*

 
{code:java}
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure:
Aborting TaskSet 4.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.2 in stage 4.0 (TID 6, executor 3): java.lang.NullPointerException
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown
 Source)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$OneSideHashJoiner$$anonfun$26.apply(StreamingSymmetricHashJoinExec.scala:412)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.findNextValueForIndex(SymmetricHashJoinStateManager.scala:197)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:221)
at 
org.apache.spark.sql.execution.streaming.state.SymmetricHashJoinStateManager$$anon$2.getNext(SymmetricHashJoinStateManager.scala:157)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at scala.collection.Iterator$JoinIterator.hasNext(Iterator.scala:212)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply$mcV$spala:338)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec$$anonfun$org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1$1.apply(Stream)
at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:583)
at 
org.apache.spark.sql.execution.streaming.StateStoreWriter$class.timeTakenMs(statefulOperators.scala:108)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.timeTakenMs(StreamingSymmetricHashJoinExec.scala:126)
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHashJoinExec.org$apache$spark$sql$execution$streaming$StreamingSymmetricHashJoinExec$$onOutputCompletion$1(StreamingSymmetricHashJ
at 
org.apache.spark.sql.execution.streaming.StreamingSymmetricHa

[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110836#comment-17110836
 ] 

Apache Spark commented on SPARK-31705:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/28575

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custk

[jira] [Assigned] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31705:


Assignee: Apache Spark  (was: Yuming Wang)

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Apache Spark
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>

[jira] [Commented] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110835#comment-17110835
 ] 

Apache Spark commented on SPARK-31705:
--

User 'wangyum' has created a pull request for this issue:
https://github.com/apache/spark/pull/28575

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custk

[jira] [Assigned] (SPARK-31705) Rewrite join condition to conjunctive normal form

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31705:


Assignee: Yuming Wang  (was: Apache Spark)

> Rewrite join condition to conjunctive normal form
> -
>
> Key: SPARK-31705
> URL: https://issues.apache.org/jira/browse/SPARK-31705
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Rewrite join condition to [conjunctive normal 
> form|https://en.wikipedia.org/wiki/Conjunctive_normal_form] to push more 
> conditions to filter.
> PostgreSQL:
> {code:sql}
> CREATE TABLE lineitem (l_orderkey BIGINT, l_partkey BIGINT, l_suppkey BIGINT, 
>   
> l_linenumber INT, l_quantity DECIMAL(10,0), l_extendedprice DECIMAL(10,0),
> 
> l_discount DECIMAL(10,0), l_tax DECIMAL(10,0), l_returnflag varchar(255), 
>   
> l_linestatus varchar(255), l_shipdate DATE, l_commitdate DATE, l_receiptdate 
> DATE,
> l_shipinstruct varchar(255), l_shipmode varchar(255), l_comment varchar(255));
>   
> CREATE TABLE orders (
> o_orderkey BIGINT, o_custkey BIGINT, o_orderstatus varchar(255),   
> o_totalprice DECIMAL(10,0), o_orderdate DATE, o_orderpriority varchar(255),
> o_clerk varchar(255), o_shippriority INT, o_comment varchar(255));  
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem,
>orders
> WHERE  l_orderkey = o_orderkey
>AND ( ( l_suppkey > 3
>AND o_custkey > 13 )
>   OR ( l_suppkey > 1
>AND o_custkey > 11 ) )
>AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*)
> FROM   lineitem
>JOIN orders
>  ON l_orderkey = o_orderkey
> AND ( ( l_suppkey > 3
> AND o_custkey > 13 )
>OR ( l_suppkey > 1
> AND o_custkey > 11 ) )
> AND l_partkey > 19;
> EXPLAIN
> SELECT Count(*) 
> FROM   lineitem, 
>orders 
> WHERE  l_orderkey = o_orderkey 
>AND NOT ( ( l_suppkey > 3 
>AND ( l_suppkey > 2 
>   OR o_custkey > 13 ) ) 
>   OR ( l_suppkey > 1 
>AND o_custkey > 11 ) ) 
>AND l_partkey > 19;
> {code}
> {noformat}
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem,
> postgres-#orders
> postgres-# WHERE  l_orderkey = o_orderkey
> postgres-#AND ( ( l_suppkey > 3
> postgres(#AND o_custkey > 13 )
> postgres(#   OR ( l_suppkey > 1
> postgres(#AND o_custkey > 11 ) )
> postgres-#AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>->  Seq Scan on lineitem  (cost=0.00..10.53 rows=6 width=16)
>  Filter: ((l_partkey > 19) AND ((l_suppkey > 3) OR 
> (l_suppkey > 1)))
> (9 rows)
> postgres=# EXPLAIN
> postgres-# SELECT Count(*)
> postgres-# FROM   lineitem
> postgres-#JOIN orders
> postgres-#  ON l_orderkey = o_orderkey
> postgres-# AND ( ( l_suppkey > 3
> postgres(# AND o_custkey > 13 )
> postgres(#OR ( l_suppkey > 1
> postgres(# AND o_custkey > 11 ) )
> postgres-# AND l_partkey > 19;
>QUERY PLAN
> -
>  Aggregate  (cost=21.18..21.19 rows=1 width=8)
>->  Hash Join  (cost=10.60..21.17 rows=2 width=0)
>  Hash Cond: (orders.o_orderkey = lineitem.l_orderkey)
>  Join Filter: (((lineitem.l_suppkey > 3) AND (orders.o_custkey > 13)) 
> OR ((lineitem.l_suppkey > 1) AND (orders.o_custkey > 11)))
>  ->  Seq Scan on orders  (cost=0.00..10.45 rows=17 width=16)
>Filter: ((o_custkey > 13) OR (o_custkey > 11))
>  ->  Hash  (cost=10.53..10.53 rows=6 width=16)
>-

[jira] [Commented] (SPARK-31752) Add sql doc for interval type

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110813#comment-17110813
 ] 

Apache Spark commented on SPARK-31752:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28574

> Add sql doc for interval type
> -
>
> Key: SPARK-31752
> URL: https://issues.apache.org/jira/browse/SPARK-31752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31752) Add sql doc for interval type

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110812#comment-17110812
 ] 

Apache Spark commented on SPARK-31752:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/28574

> Add sql doc for interval type
> -
>
> Key: SPARK-31752
> URL: https://issues.apache.org/jira/browse/SPARK-31752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31752) Add sql doc for interval type

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31752:


Assignee: Apache Spark

> Add sql doc for interval type
> -
>
> Key: SPARK-31752
> URL: https://issues.apache.org/jira/browse/SPARK-31752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31752) Add sql doc for interval type

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31752:


Assignee: (was: Apache Spark)

> Add sql doc for interval type
> -
>
> Key: SPARK-31752
> URL: https://issues.apache.org/jira/browse/SPARK-31752
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29458) Document scalar functions usage in APIs in SQL getting started.

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110808#comment-17110808
 ] 

Apache Spark commented on SPARK-29458:
--

User 'huaxingao' has created a pull request for this issue:
https://github.com/apache/spark/pull/28290

> Document scalar functions usage in APIs in SQL getting started.
> ---
>
> Key: SPARK-29458
> URL: https://issues.apache.org/jira/browse/SPARK-29458
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Dilip Biswal
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31753) Add missing keywords in the SQL documents

2020-05-18 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-31753:


 Summary: Add missing keywords in the SQL documents
 Key: SPARK-31753
 URL: https://issues.apache.org/jira/browse/SPARK-31753
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


Some keywords are missing in the SQL documents and a list of them is as 
follows. 
 [https://github.com/apache/spark/pull/28290#issuecomment-619321301]
{code:java}
AFTER
CASE/ELSE
WHEN/THEN
IGNORE NULLS
LATERAL VIEW (OUTER)?
MAP KEYS TERMINATED BY
NULL DEFINED AS
LINES TERMINATED BY
ESCAPED BY
COLLECTION ITEMS TERMINATED BY
EXPLAIN LOGICAL
PIVOT
{code}
They should be documented there, too.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31752) Add sql doc for interval type

2020-05-18 Thread Kent Yao (Jira)

Kent Yao created SPARK-31752:


 Summary: Add sql doc for interval type
 Key: SPARK-31752
 URL: https://issues.apache.org/jira/browse/SPARK-31752
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31102) spark-sql fails to parse when contains comment

2020-05-18 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31102.
--
Fix Version/s: 3.0.0
 Assignee: Javier Fuentes
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/27920]

> spark-sql fails to parse when contains comment
> --
>
> Key: SPARK-31102
> URL: https://issues.apache.org/jira/browse/SPARK-31102
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Javier Fuentes
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:sql}
> select
>   1,
>   -- two
>   2;
> {code}
> {noformat}
> spark-sql> select
>  >   1,
>  >   -- two
>  >   2;
> Error in query:
> mismatched input '' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', 
> 'ANALYZE', 'AND', 'ANTI', 'ANY', 'ARCHIVE', 'ARRAY', 'AS', 'ASC', 'AT', 
> 'AUTHORIZATION', 'BETWEEN', 'BOTH', 'BUCKET', 'BUCKETS', 'BY', 'CACHE', 
> 'CASCADE', 'CASE', 'CAST', 'CHANGE', 'CHECK', 'CLEAR', 'CLUSTER', 
> 'CLUSTERED', 'CODEGEN', 'COLLATE', 'COLLECTION', 'COLUMN', 'COLUMNS', 
> 'COMMENT', 'COMMIT', 'COMPACT', 'COMPACTIONS', 'COMPUTE', 'CONCATENATE', 
> 'CONSTRAINT', 'COST', 'CREATE', 'CROSS', 'CUBE', 'CURRENT', 'CURRENT_DATE', 
> 'CURRENT_TIME', 'CURRENT_TIMESTAMP', 'CURRENT_USER', 'DATA', 'DATABASE', 
> DATABASES, 'DAY', 'DBPROPERTIES', 'DEFINED', 'DELETE', 'DELIMITED', 'DESC', 
> 'DESCRIBE', 'DFS', 'DIRECTORIES', 'DIRECTORY', 'DISTINCT', 'DISTRIBUTE', 
> 'DROP', 'ELSE', 'END', 'ESCAPE', 'ESCAPED', 'EXCEPT', 'EXCHANGE', 'EXISTS', 
> 'EXPLAIN', 'EXPORT', 'EXTENDED', 'EXTERNAL', 'EXTRACT', 'FALSE', 'FETCH', 
> 'FIELDS', 'FILTER', 'FILEFORMAT', 'FIRST', 'FOLLOWING', 'FOR', 'FOREIGN', 
> 'FORMAT', 'FORMATTED', 'FROM', 'FULL', 'FUNCTION', 'FUNCTIONS', 'GLOBAL', 
> 'GRANT', 'GROUP', 'GROUPING', 'HAVING', 'HOUR', 'IF', 'IGNORE', 'IMPORT', 
> 'IN', 'INDEX', 'INDEXES', 'INNER', 'INPATH', 'INPUTFORMAT', 'INSERT', 
> 'INTERSECT', 'INTERVAL', 'INTO', 'IS', 'ITEMS', 'JOIN', 'KEYS', 'LAST', 
> 'LATERAL', 'LAZY', 'LEADING', 'LEFT', 'LIKE', 'LIMIT', 'LINES', 'LIST', 
> 'LOAD', 'LOCAL', 'LOCATION', 'LOCK', 'LOCKS', 'LOGICAL', 'MACRO', 'MAP', 
> 'MATCHED', 'MERGE', 'MINUTE', 'MONTH', 'MSCK', 'NAMESPACE', 'NAMESPACES', 
> 'NATURAL', 'NO', NOT, 'NULL', 'NULLS', 'OF', 'ON', 'ONLY', 'OPTION', 
> 'OPTIONS', 'OR', 'ORDER', 'OUT', 'OUTER', 'OUTPUTFORMAT', 'OVER', 'OVERLAPS', 
> 'OVERLAY', 'OVERWRITE', 'PARTITION', 'PARTITIONED', 'PARTITIONS', 'PERCENT', 
> 'PIVOT', 'PLACING', 'POSITION', 'PRECEDING', 'PRIMARY', 'PRINCIPALS', 
> 'PROPERTIES', 'PURGE', 'QUERY', 'RANGE', 'RECORDREADER', 'RECORDWRITER', 
> 'RECOVER', 'REDUCE', 'REFERENCES', 'REFRESH', 'RENAME', 'REPAIR', 'REPLACE', 
> 'RESET', 'RESTRICT', 'REVOKE', 'RIGHT', RLIKE, 'ROLE', 'ROLES', 'ROLLBACK', 
> 'ROLLUP', 'ROW', 'ROWS', 'SCHEMA', 'SECOND', 'SELECT', 'SEMI', 'SEPARATED', 
> 'SERDE', 'SERDEPROPERTIES', 'SESSION_USER', 'SET', 'MINUS', 'SETS', 'SHOW', 
> 'SKEWED', 'SOME', 'SORT', 'SORTED', 'START', 'STATISTICS', 'STORED', 
> 'STRATIFY', 'STRUCT', 'SUBSTR', 'SUBSTRING', 'TABLE', 'TABLES', 
> 'TABLESAMPLE', 'TBLPROPERTIES', TEMPORARY, 'TERMINATED', 'THEN', 'TO', 
> 'TOUCH', 'TRAILING', 'TRANSACTION', 'TRANSACTIONS', 'TRANSFORM', 'TRIM', 
> 'TRUE', 'TRUNCATE', 'TYPE', 'UNARCHIVE', 'UNBOUNDED', 'UNCACHE', 'UNION', 
> 'UNIQUE', 'UNKNOWN', 'UNLOCK', 'UNSET', 'UPDATE', 'USE', 'USER', 'USING', 
> 'VALUES', 'VIEW', 'WHEN', 'WHERE', 'WINDOW', 'WITH', 'YEAR', '+', '-', '*', 
> 'DIV', '~', STRING, BIGINT_LITERAL, SMALLINT_LITERAL, TINYINT_LITERAL, 
> INTEGER_VALUE, EXPONENT_VALUE, DECIMAL_VALUE, DOUBLE_LITERAL, 
> BIGDECIMAL_LITERAL, IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 3, pos 2)
> == SQL ==
> select
>   1,
> --^^^
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation

2020-05-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-31739:
-
Fix Version/s: 3.0.0

> Docstring syntax issues prevent proper compilation of documentation
> ---
>
> Key: SPARK-31739
> URL: https://issues.apache.org/jira/browse/SPARK-31739
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.5
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Trivial
> Fix For: 3.0.0, 3.1.0
>
>
> Some docstrings contain mistakes, like missing or spurious spaces, which 
> prevent the documentation from being rendered as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31555) Improve cache block migration

2020-05-18 Thread Dale Richardson (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110732#comment-17110732
 ] 

Dale Richardson commented on SPARK-31555:
-

Hi [~holden], happy to have a go at this.

> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 
> 6) Clarify the logic for determining the number of replicas
> 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count 
> to wait for the executors to come up. imho this is the least important.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-18 Thread Sudharshann D. (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110729#comment-17110729
 ] 

Sudharshann D. commented on SPARK-31579:


Please see my proof of concept 
[https://github.com/Sudhar287/spark/pull/1/files|http://example.com]

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory

2020-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31692:
--
Affects Version/s: 2.3.4
   2.4.5

> Hadoop confs passed via spark config are not set in URLStream Handler Factory
> -
>
> Key: SPARK-31692
> URL: https://issues.apache.org/jira/browse/SPARK-31692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Karuppayya
>Assignee: Karuppayya
>Priority: Major
> Fix For: 3.0.0
>
>
> Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in 
> URLStreamHandlerFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25694) URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue

2020-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25694:
--
Fix Version/s: 2.4.7

> URL.setURLStreamHandlerFactory causing incompatible HttpURLConnection issue
> ---
>
> Key: SPARK-25694
> URL: https://issues.apache.org/jira/browse/SPARK-25694
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.4.4, 3.0.0
>Reporter: Bo Yang
>Assignee: Zhou Jiang
>Priority: Minor
> Fix For: 3.0.0, 2.4.7
>
>
> URL.setURLStreamHandlerFactory() in SharedState causes URL.openConnection() 
> returns FsUrlConnection object, which is not compatible with 
> HttpURLConnection. This will cause exception when using some third party http 
> library (e.g. scalaj.http).
> The following code in Spark 2.3.0 introduced the issue: 
> sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala:
> {code}
> object SharedState extends Logging  {   ...   
>   URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory())   ...
> }
> {code}
> Here is the example exception when using scalaj.http in Spark:
> {code}
>  StackTrace: scala.MatchError: 
> org.apache.hadoop.fs.FsUrlConnection:[http://.example.com|http://.example.com/]
>  (of class org.apache.hadoop.fs.FsUrlConnection)
>  at 
> scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:343)
>  at scalaj.http.HttpRequest.exec(Http.scala:335)
>  at scalaj.http.HttpRequest.asString(Http.scala:455)
> {code}
>   
> One option to fix the issue is to return null in 
> URLStreamHandlerFactory.createURLStreamHandler when the protocol is 
> http/https, so it will use the default behavior and be compatible with 
> scalaj.http. Following is the code example:
> {code}
> class SparkUrlStreamHandlerFactory extends URLStreamHandlerFactory with 
> Logging {
>   private val fsUrlStreamHandlerFactory = new FsUrlStreamHandlerFactory()
>   override def createURLStreamHandler(protocol: String): URLStreamHandler = {
> val handler = fsUrlStreamHandlerFactory.createURLStreamHandler(protocol)
> if (handler == null) {
>   return null
> }
> if (protocol != null &&
>   (protocol.equalsIgnoreCase("http")
>   || protocol.equalsIgnoreCase("https"))) {
>   // return null to use system default URLStreamHandler
>   null
> } else {
>   handler
> }
>   }
> }
> {code}
> I would like to get some discussion here before submitting a pull request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31692) Hadoop confs passed via spark config are not set in URLStream Handler Factory

2020-05-18 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31692:
--
Fix Version/s: 2.4.7

> Hadoop confs passed via spark config are not set in URLStream Handler Factory
> -
>
> Key: SPARK-31692
> URL: https://issues.apache.org/jira/browse/SPARK-31692
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.4, 2.4.5, 3.0.0
>Reporter: Karuppayya
>Assignee: Karuppayya
>Priority: Major
> Fix For: 3.0.0, 2.4.7
>
>
> Hadoop conf passed via spark config(as "spark.hadoop.*") are not set in 
> URLStreamHandlerFactory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110726#comment-17110726
 ] 

Apache Spark commented on SPARK-31579:
--

User 'Sudhar287' has created a pull request for this issue:
https://github.com/apache/spark/pull/28573

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31579:


Assignee: (was: Apache Spark)

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110725#comment-17110725
 ] 

Apache Spark commented on SPARK-31579:
--

User 'Sudhar287' has created a pull request for this issue:
https://github.com/apache/spark/pull/28573

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31579) Replace floorDiv by / in localRebaseGregorianToJulianDays()

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31579:


Assignee: Apache Spark

> Replace floorDiv by / in localRebaseGregorianToJulianDays()
> ---
>
> Key: SPARK-31579
> URL: https://issues.apache.org/jira/browse/SPARK-31579
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Minor
>  Labels: starter
>
> Most likely utcCal.getTimeInMillis % MILLIS_PER_DAY == 0 but need to check 
> that for all available time zones in the range of [0001, 2100] years with the 
> step of 1 hour or maybe smaller. If this hypothesis is confirmed, floorDiv 
> can be replaced by /, and this should improve performance of 
> RebaseDateTime.localRebaseGregorianToJulianDays.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31257) Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes

2020-05-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31257:
-
Summary: Unify create table syntax to fix ambiguous two different CREATE 
TABLE syntaxes  (was: Fix ambiguous two different CREATE TABLE syntaxes)

> Unify create table syntax to fix ambiguous two different CREATE TABLE syntaxes
> --
>
> Key: SPARK-31257
> URL: https://issues.apache.org/jira/browse/SPARK-31257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> There's a discussion in dev@ mailing list to point out ambiguous syntaxes for 
> CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via 
> unifying the create table syntax.
> https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E
> We should ensure the new "single" create table syntax is very deterministic 
> to both devs and end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31257) Fix ambiguous two different CREATE TABLE syntaxes

2020-05-18 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-31257:
-
Affects Version/s: (was: 3.0.0)
   3.1.0
  Description: 
There's a discussion in dev@ mailing list to point out ambiguous syntaxes for 
CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via 
unifying the create table syntax.

https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E

We should ensure the new "single" create table syntax is very deterministic to 
both devs and end users.

  was:
There's a discussion in dev@ mailing list to point out ambiguous syntaxes for 
CREATE TABLE DDL. This issue tracks the efforts to resolve the problem.

https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E

Note that the priority of this issue is set to blocker as the ambiguity is 
brought by SPARK-30098 which will be shipped in Spark 3.0.0; before we ship 
SPARK-30098 we should fix the syntax and ensure the syntax is very 
deterministic to both devs and end users.

   Issue Type: Improvement  (was: Bug)
 Priority: Major  (was: Blocker)

> Fix ambiguous two different CREATE TABLE syntaxes
> -
>
> Key: SPARK-31257
> URL: https://issues.apache.org/jira/browse/SPARK-31257
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> There's a discussion in dev@ mailing list to point out ambiguous syntaxes for 
> CREATE TABLE DDL. This issue tracks the efforts to resolve the root issue via 
> unifying the create table syntax.
> https://lists.apache.org/thread.html/rf1acfaaa3de2d3129575199c28e7d529d38f2783e7d3c5be2ac8923d%40%3Cdev.spark.apache.org%3E
> We should ensure the new "single" create table syntax is very deterministic 
> to both devs and end users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31718) DataSourceV2 unexpected behavior with partition data distribution

2020-05-18 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31718:
-
   Fix Version/s: (was: 2.4.0)
Target Version/s:   (was: 2.4.0)

Don't set Fix/Target Version

>  DataSourceV2 unexpected behavior with partition data distribution
> --
>
> Key: SPARK-31718
> URL: https://issues.apache.org/jira/browse/SPARK-31718
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.0
>Reporter: Serhii
>Priority: Major
>
> Hi team,
>   
>  We are using DataSourceV2.
>   
>  We have a queston regarding using interface 
> org.apache.spark.sql.sources.v2.writer.DataWriter
>   
>  We have faced with following unexpected behavior.
>  When we use a repartion on dataframe we expect that for each partion Spark 
> will create new instance of DataWriter interface and sends the repartition 
> data to appropriate instances but sometimes we observe that Spark sends the 
> data from different partitions to the same instance of DataWriter interface.
>  It behavior sometimes occures on Yarn cluster.
>   
>  If we run Spark job as Local run Spark really creates a new instance of 
> DataWriter interface for each partiion after repartion and publishes the 
> repartion data to appropriate instances.
>   
> Possible there is a Spark limit a number of  DataWriter instances?
>  Can you explain it is a bug or expected behavior?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16423) Inconsistent settings on the first day of a week

2020-05-18 Thread Eric Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577
 ] 

Eric Sun edited comment on SPARK-16423 at 5/18/20, 8:12 PM:


Just to clarify the issue again:

 

{code:scala}
scala> spark.sql("SELECT dayofweek('2020-05-18')").show()
+---+
|dayofweek(CAST(2020-05-18 AS DATE))|
+---+
|  2|
+---+


scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, 
'-MM-dd'))").show()
+--+
|date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS 
TIMESTAMP))|
+--+
|   
2020-05-18 00:00:00|
+--+
{code}

* 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this 
week is *Sunday*, right?
* date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is 
*Monday*, see a tiny discrepancy?

https://dev.mysql.com/doc/refman/8.0/en/date-and-time-functions.html#function_date-format
{code:sql}
-- MySql behavior
SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u')
FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x;

ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, 
'%YW%u')|DATE_FORMAT(ds, '%YW%V')|
---|-||||
2020-05-18 19:20:39|2|2020W20 |2020W21  
   |2020W20 |
{code}

The request for this JIRA is: should we allow date_trunc() and dayofweek() 
support different 1st day of week option (SUNDAY or MONDAY)?



was (Author: ericsun2):
Just to clarify the issue again:

 

{code:scala}
scala> spark.sql("SELECT dayofweek('2020-05-18')").show()
+---+
|dayofweek(CAST(2020-05-18 AS DATE))|
+---+
|  2|
+---+


scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, 
'-MM-dd'))").show()
+--+
|date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS 
TIMESTAMP))|
+--+
|   
2020-05-18 00:00:00|
+--+
{code}

* 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this 
week is *Sunday*, right?
* date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is 
*Monday*, see a tiny discrepancy?

{code:sql}
-- MySql behavior
SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u')
FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x;

ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, 
'%YW%u')|DATE_FORMAT(ds, '%YW%V')|
---|-||||
2020-05-18 19:20:39|2|2020W20 |2020W21  
   |2020W20 |
{code}

The request for this JIRA is: should we allow date_trunc() and dayofweek() 
support different 1st day of week option (SUNDAY or MONDAY)?


> Inconsistent settings on the first day of a week
> 
>
> Key: SPARK-16423
> URL: https://issues.apache.org/jira/browse/SPARK-16423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Major
>  Labels: bulk-closed
>
> For the function {{WeekOfYear}}, we explicitly set the first day of the week 
> to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. 
> So, we are using the default first day of the week based on the locale 
> setting (see 
> https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-).
>  
> Let's do a survey on what other databases do and make the setting consistent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-16423) Inconsistent settings on the first day of a week

2020-05-18 Thread Eric Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577
 ] 

Eric Sun edited comment on SPARK-16423 at 5/18/20, 8:11 PM:


Just to clarify the issue again:

 

{code:scala}
scala> spark.sql("SELECT dayofweek('2020-05-18')").show()
+---+
|dayofweek(CAST(2020-05-18 AS DATE))|
+---+
|  2|
+---+


scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, 
'-MM-dd'))").show()
+--+
|date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS 
TIMESTAMP))|
+--+
|   
2020-05-18 00:00:00|
+--+
{code}

* 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this 
week is *Sunday*, right?
* date_trunc('week', '2020-05-18') => Monday, the 1st of of this week is 
*Monday*, see a tiny discrepancy?

{code:sql}
-- MySql behavior
SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u')
FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x;

ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, 
'%YW%u')|DATE_FORMAT(ds, '%YW%V')|
---|-||||
2020-05-18 19:20:39|2|2020W20 |2020W21  
   |2020W20 |
{code}

The request for this JIRA is: should we allow date_trunc() and dayofweek() 
support different 1st day of week option (SUNDAY or MONDAY)?



was (Author: ericsun2):
Just to clarify the issue again:

 

{code:scala}
scala> spark.sql("SELECT dayofweek('2020-05-18')").show()
+---+
|dayofweek(CAST(2020-05-18 AS DATE))|
+---+
|  2|
+---+


scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, 
'-MM-dd'))").show()
+--+
|date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS 
TIMESTAMP))|
+--+
|   
2020-05-18 00:00:00|
+--+
{code}

* 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this 
week is Sunday, right?
* date_trunc('week', '2020-05-18') => Monday, the 1st of of this week if 
Monday, see a tiny discrepancy?

{code:sql}
-- MySql behavior
SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u')
FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x;

ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, 
'%YW%u')|DATE_FORMAT(ds, '%YW%V')|
---|-||||
2020-05-18 19:20:39|2|2020W20 |2020W21  
   |2020W20 |
{code}

The request for this JIRA is: should we allow date_trunc() and dayofweek() 
support different 1st day of week option (SUNDAY or MONDAY)?


> Inconsistent settings on the first day of a week
> 
>
> Key: SPARK-16423
> URL: https://issues.apache.org/jira/browse/SPARK-16423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Major
>  Labels: bulk-closed
>
> For the function {{WeekOfYear}}, we explicitly set the first day of the week 
> to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. 
> So, we are using the default first day of the week based on the locale 
> setting (see 
> https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-).
>  
> Let's do a survey on what other databases do and make the setting consistent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16423) Inconsistent settings on the first day of a week

2020-05-18 Thread Eric Sun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-16423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110577#comment-17110577
 ] 

Eric Sun commented on SPARK-16423:
--

Just to clarify the issue again:

 

{code:scala}
scala> spark.sql("SELECT dayofweek('2020-05-18')").show()
+---+
|dayofweek(CAST(2020-05-18 AS DATE))|
+---+
|  2|
+---+


scala> spark.sql("SELECT date_trunc('week', from_unixtime(1589829639, 
'-MM-dd'))").show()
+--+
|date_trunc(week, CAST(from_unixtime(CAST(1589829639 AS BIGINT), -MM-dd) AS 
TIMESTAMP))|
+--+
|   
2020-05-18 00:00:00|
+--+
{code}

* 2020-05-18 is Monday: the dayofweek() => 2, therefore the 1st day of this 
week is Sunday, right?
* date_trunc('week', '2020-05-18') => Monday, the 1st of of this week if 
Monday, see a tiny discrepancy?

{code:sql}
-- MySql behavior
SELECT ds, DAYOFWEEK(ds), DATE_FORMAT(ds, '%YW%U'), DATE_FORMAT(ds, '%YW%u')
FROM (SELECT FROM_UNIXTIME(1589829639) as ds) x;

ds |DAYOFWEEK(ds)|DATE_FORMAT(ds, '%YW%U')|DATE_FORMAT(ds, 
'%YW%u')|DATE_FORMAT(ds, '%YW%V')|
---|-||||
2020-05-18 19:20:39|2|2020W20 |2020W21  
   |2020W20 |
{code}

The request for this JIRA is: should we allow date_trunc() and dayofweek() 
support different 1st day of week option (SUNDAY or MONDAY)?


> Inconsistent settings on the first day of a week
> 
>
> Key: SPARK-16423
> URL: https://issues.apache.org/jira/browse/SPARK-16423
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Priority: Major
>  Labels: bulk-closed
>
> For the function {{WeekOfYear}}, we explicitly set the first day of the week 
> to {{Calendar.MONDAY}}. However, {{FromUnixTime}} does not explicitly set it. 
> So, we are using the default first day of the week based on the locale 
> setting (see 
> https://docs.oracle.com/javase/8/docs/api/java/util/Calendar.html#setFirstDayOfWeek-int-).
>  
> Let's do a survey on what other databases do and make the setting consistent. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20732) Copy cache data when node is being shut down

2020-05-18 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-20732.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

> Copy cache data when node is being shut down
> 
>
> Key: SPARK-20732
> URL: https://issues.apache.org/jira/browse/SPARK-20732
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Holden Karau
>Assignee: Prakhar Jain
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31555) Improve cache block migration

2020-05-18 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31555:
-
Description: 
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

5) Log the attempt number for debugging 

6) Clarify the logic for determining the number of replicas

7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count to 
wait for the executors to come up. imho this is the least important.

  was:
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

5) Log the attempt number for debugging 

6) Clarify the logic for determining the number of replicas


> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 
> 6) Clarify the logic for determining the number of replicas
> 7) Consider using TestUtils.waitUntilExecutorsUp in tests rather than count 
> to wait for the executors to come up. imho this is the least important.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31555) Improve cache block migration

2020-05-18 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31555:
-
Description: 
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

5) Log the attempt number for debugging 

6) Clarify the logic for determining the number of replicas

  was:
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

5) Log the attempt number for debugging 


> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 
> 6) Clarify the logic for determining the number of replicas



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31555) Improve cache block migration

2020-05-18 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31555:
-
Description: 
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

5) Log the attempt number for debugging 

  was:
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

 


> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
> 5) Log the attempt number for debugging 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31555) Improve cache block migration

2020-05-18 Thread Holden Karau (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31555?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau updated SPARK-31555:
-
Description: 
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Are there any blocks we don't need to replicate (e.g. they are already 
stored on the desired number of executors even once we remove the executors 
slated for decommissioning).

4) Do we want to prioritize migrating blocks with no replicas

 

  was:
We should explore the following improvements to cache block migration:

1) Peer selection (right now may overbalance on certain peers)

2) Do we need to configure the number of blocks to be migrated at the same time

3) Do we want to prioritize migrating blocks with no replicas

 


> Improve cache block migration
> -
>
> Key: SPARK-31555
> URL: https://issues.apache.org/jira/browse/SPARK-31555
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Holden Karau
>Priority: Major
>
> We should explore the following improvements to cache block migration:
> 1) Peer selection (right now may overbalance on certain peers)
> 2) Do we need to configure the number of blocks to be migrated at the same 
> time
> 3) Are there any blocks we don't need to replicate (e.g. they are already 
> stored on the desired number of executors even once we remove the executors 
> slated for decommissioning).
> 4) Do we want to prioritize migrating blocks with no replicas
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array

2020-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30267:

Affects Version/s: 3.0.0

> avro deserializer: ArrayList cannot be cast to GenericData$Array
> 
>
> Key: SPARK-30267
> URL: https://issues.apache.org/jira/browse/SPARK-30267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 3.0.0
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
>
> On some more complex avro objects, the Avro Deserializer fails with the 
> following stack trace:
> {code}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> org.apache.avro.generic.GenericData$Array
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)
> {code}
> This is because the Deserializer assumes that an array is always of the very 
> specific {{org.apache.avro.generic.GenericData$Array}} which is not always 
> the case.
> Making it a normal list works.
> A github PR is coming up to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30267) avro deserializer: ArrayList cannot be cast to GenericData$Array

2020-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30267:

Fix Version/s: (was: 3.0.0)

> avro deserializer: ArrayList cannot be cast to GenericData$Array
> 
>
> Key: SPARK-30267
> URL: https://issues.apache.org/jira/browse/SPARK-30267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Steven Aerts
>Assignee: Steven Aerts
>Priority: Major
>
> On some more complex avro objects, the Avro Deserializer fails with the 
> following stack trace:
> {code}
> java.lang.ClassCastException: java.util.ArrayList cannot be cast to 
> org.apache.avro.generic.GenericData$Array
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19(AvroDeserializer.scala:170)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$newWriter$19$adapted(AvroDeserializer.scala:169)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1(AvroDeserializer.scala:314)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$1$adapted(AvroDeserializer.scala:310)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2(AvroDeserializer.scala:332)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$getRecordWriter$2$adapted(AvroDeserializer.scala:329)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.$anonfun$converter$3(AvroDeserializer.scala:56)
>   at 
> org.apache.spark.sql.avro.AvroDeserializer.deserialize(AvroDeserializer.scala:70)
> {code}
> This is because the Deserializer assumes that an array is always of the very 
> specific {{org.apache.avro.generic.GenericData$Array}} which is not always 
> the case.
> Making it a normal list works.
> A github PR is coming up to fix this.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28554) implement basic catalog functionalities

2020-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28554:

Affects Version/s: (was: 3.0.0)
   3.1.0

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28554) implement basic catalog functionalities

2020-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-28554:

Fix Version/s: (was: 3.0.0)

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29441) Unable to Alter table column type in spark.

2020-05-18 Thread krishnendu mukherjee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110436#comment-17110436
 ] 

krishnendu mukherjee commented on SPARK-29441:
--

also column name is not being altered

> Unable to Alter table column type in spark.
> ---
>
> Key: SPARK-29441
> URL: https://issues.apache.org/jira/browse/SPARK-29441
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell
>Affects Versions: 2.3.1
> Environment: spark -2.3
> hadoop -2.4
>Reporter: prudhviraj
>Priority: Major
>
> Unable to alter table column type in spark.
> scala> spark.sql("""alter table tablename change col1 col1 string""")
> org.apache.spark.sql.AnalysisException: ALTER TABLE CHANGE COLUMN is not 
> supported for changing column 'col1' with type 'LongType' to 'col1' with type 
> 'StringType';



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31751) spark serde property path overwrites table property location

2020-05-18 Thread Nithin (Jira)

Nithin created SPARK-31751:
--

 Summary: spark serde property path overwrites table property 
location
 Key: SPARK-31751
 URL: https://issues.apache.org/jira/browse/SPARK-31751
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Nithin


This is an issue that have caused us so many data errors. 

1) using spark ( with hive context enabled )

df = spark.createDataFrame([\{"a": "x", "b": "y", "c": "3"}])
df.write.format("orc").option("compression", 
"ZLIB").mode("overwrite").saveAsTable('test_spark');

 

2) from hive 

alter table test_spark rename to test_spark2

 

3)from spark-sql from command line ( note : not pyspark or spark-shell )  

select * from test_spark2

 

will give output 

NULL NULL NULL
Time taken: 0.334 seconds, Fetched 1 row(s)

 

This will throw NULL because , pyspark write API will add a serde property 
called path into the hive metastore. when hive renames the table , it do not 
understand this serde and hence keep it as it is. Now when spark-sql tries to 
read it , it will honor the serde property first and then tries to read from 
the non-existent hdfs location. If it had given an error , then also it would 
have been fine , but throwing out NULL will cause applications to fail pretty 
bad. Spark claims to support hive tables , hence it should respect hive 
metastore location property rather than spark serde property when trying to 
read a table. This cannot be classified as a expected behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31693) Investigate AmpLab Jenkins server network issue

2020-05-18 Thread Shane Knapp (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110409#comment-17110409
 ] 

Shane Knapp commented on SPARK-31693:
-

apache.org blacklisted us:

'''
That IP was banned 5 days ago for more than 1,000 download page views per 24 
hours (1020 >= limit of 1000). Typically this is due to some misconfigured CI 
system hitting our systems to download packages instead of using a local cache.
'''

i asked them to un-ban us while i investigate the root cause.

> Investigate AmpLab Jenkins server network issue
> ---
>
> Key: SPARK-31693
> URL: https://issues.apache.org/jira/browse/SPARK-31693
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Priority: Critical
>
> Given the series of failures in Spark packaging Jenkins job, it seems that 
> there is a network issue in AmbLab Jenkins cluster.
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/job/spark-master-maven-snapshots/
> - The node failed to talk to GitBox. (SPARK-31687) -> GitHub is okay.
> - The node failed to download the maven mirror. (SPARK-31691) -> The primary 
> host is okay.
> - The node failed to communicate repository.apache.org. (Current master 
> branch Jenkins job failure)
> {code}
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-deploy-plugin:3.0.0-M1:deploy (default-deploy) 
> on project spark-parent_2.12: ArtifactDeployerException: Failed to retrieve 
> remote metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml: Could 
> not transfer metadata 
> org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT/maven-metadata.xml from/to 
> apache.snapshots.https 
> (https://repository.apache.org/content/repositories/snapshots): Transfer 
> failed for 
> https://repository.apache.org/content/repositories/snapshots/org/apache/spark/spark-parent_2.12/3.1.0-SNAPSHOT/maven-metadata.xml:
>  Connect to repository.apache.org:443 [repository.apache.org/207.244.88.140] 
> failed: Connection timed out (Connection timed out) -> [Help 1]
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31750:


Assignee: (was: Apache Spark)

> Eliminate UpCast if child's dataType is DecimalType
> ---
>
> Key: SPARK-31750
> URL: https://issues.apache.org/jira/browse/SPARK-31750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> sql("select cast(11 as decimal(38, 0)) as 
> d")
>   .write.mode("overwrite")
>   .parquet(f.getAbsolutePath)
> spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
> {code}
> {code:java}
> [info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
> decimal(38,0) to decimal(38,18).
> [info] The type path of the target object is:
> [info] - root class: "scala.math.BigDecimal"
> [info] You can either add an explicit cast to the input data or choose a 
> higher precision type of the field in the target object;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110367#comment-17110367
 ] 

Apache Spark commented on SPARK-31750:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28572

> Eliminate UpCast if child's dataType is DecimalType
> ---
>
> Key: SPARK-31750
> URL: https://issues.apache.org/jira/browse/SPARK-31750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> sql("select cast(11 as decimal(38, 0)) as 
> d")
>   .write.mode("overwrite")
>   .parquet(f.getAbsolutePath)
> spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
> {code}
> {code:java}
> [info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
> decimal(38,0) to decimal(38,18).
> [info] The type path of the target object is:
> [info] - root class: "scala.math.BigDecimal"
> [info] You can either add an explicit cast to the input data or choose a 
> higher precision type of the field in the target object;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110369#comment-17110369
 ] 

Apache Spark commented on SPARK-31750:
--

User 'Ngone51' has created a pull request for this issue:
https://github.com/apache/spark/pull/28572

> Eliminate UpCast if child's dataType is DecimalType
> ---
>
> Key: SPARK-31750
> URL: https://issues.apache.org/jira/browse/SPARK-31750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> sql("select cast(11 as decimal(38, 0)) as 
> d")
>   .write.mode("overwrite")
>   .parquet(f.getAbsolutePath)
> spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
> {code}
> {code:java}
> [info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
> decimal(38,0) to decimal(38,18).
> [info] The type path of the target object is:
> [info] - root class: "scala.math.BigDecimal"
> [info] You can either add an explicit cast to the input data or choose a 
> higher precision type of the field in the target object;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31750:


Assignee: Apache Spark

> Eliminate UpCast if child's dataType is DecimalType
> ---
>
> Key: SPARK-31750
> URL: https://issues.apache.org/jira/browse/SPARK-31750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> sql("select cast(11 as decimal(38, 0)) as 
> d")
>   .write.mode("overwrite")
>   .parquet(f.getAbsolutePath)
> spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
> {code}
> {code:java}
> [info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
> decimal(38,0) to decimal(38,18).
> [info] The type path of the target object is:
> [info] - root class: "scala.math.BigDecimal"
> [info] You can either add an explicit cast to the input data or choose a 
> higher precision type of the field in the target object;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31750) Eliminate UpCast if child's dataType is DecimalType

2020-05-18 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi updated SPARK-31750:
-
Summary: Eliminate UpCast if child's dataType is DecimalType  (was: 
Eliminate UpCast if chid's dataType is DecimalType)

> Eliminate UpCast if child's dataType is DecimalType
> ---
>
> Key: SPARK-31750
> URL: https://issues.apache.org/jira/browse/SPARK-31750
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: wuyi
>Priority: Major
>
>  
> {code:java}
> sql("select cast(11 as decimal(38, 0)) as 
> d")
>   .write.mode("overwrite")
>   .parquet(f.getAbsolutePath)
> spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
> {code}
> {code:java}
> [info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
> decimal(38,0) to decimal(38,18).
> [info] The type path of the target object is:
> [info] - root class: "scala.math.BigDecimal"
> [info] You can either add an explicit cast to the input data or choose a 
> higher precision type of the field in the target object;
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
> [info]   at 
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
> [info]   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31750) Eliminate UpCast if chid's dataType is DecimalType

2020-05-18 Thread wuyi (Jira)

wuyi created SPARK-31750:


 Summary: Eliminate UpCast if chid's dataType is DecimalType
 Key: SPARK-31750
 URL: https://issues.apache.org/jira/browse/SPARK-31750
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: wuyi


 
{code:java}
sql("select cast(11 as decimal(38, 0)) as 
d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
{code}
{code:java}
[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from 
decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher 
precision type of the field in the target object;
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31738) Describe 'L' and 'M' month pattern letters

2020-05-18 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31738?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31738.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28558
[https://github.com/apache/spark/pull/28558]

> Describe 'L' and 'M' month pattern letters
> --
>
> Key: SPARK-31738
> URL: https://issues.apache.org/jira/browse/SPARK-31738
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.0.0
>
>
> # Describe difference between 'M' and 'L' pattern letters
> # Add examples



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation

2020-05-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31739:


Assignee: David Toneian

> Docstring syntax issues prevent proper compilation of documentation
> ---
>
> Key: SPARK-31739
> URL: https://issues.apache.org/jira/browse/SPARK-31739
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.5
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Trivial
>
> Some docstrings contain mistakes, like missing or spurious spaces, which 
> prevent the documentation from being rendered as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31739) Docstring syntax issues prevent proper compilation of documentation

2020-05-18 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31739.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28559
[https://github.com/apache/spark/pull/28559]

> Docstring syntax issues prevent proper compilation of documentation
> ---
>
> Key: SPARK-31739
> URL: https://issues.apache.org/jira/browse/SPARK-31739
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.5
>Reporter: David Toneian
>Assignee: David Toneian
>Priority: Trivial
> Fix For: 3.1.0
>
>
> Some docstrings contain mistakes, like missing or spurious spaces, which 
> prevent the documentation from being rendered as intended.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31749) Allow to set owner reference for the driver pod (cluster mode)

2020-05-18 Thread Tamas Jambor (Jira)

Tamas Jambor created SPARK-31749:


 Summary: Allow to set owner reference for the driver pod (cluster 
mode)
 Key: SPARK-31749
 URL: https://issues.apache.org/jira/browse/SPARK-31749
 Project: Spark
  Issue Type: New Feature
  Components: Kubernetes
Affects Versions: 2.4.5
Reporter: Tamas Jambor


Currently there is no way to pass ownerReferences to the driver pod in cluster 
mode. This makes it difficult for the upstream process to clean up pods after 
they completed. 

 

Something like this would be useful:

spark.kubernetes.driver.ownerReferences.[Name]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21784) Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign keys

2020-05-18 Thread krishnendu mukherjee (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110140#comment-17110140
 ] 

krishnendu mukherjee commented on SPARK-21784:
--

has this been addded to spark latest release?

> Add ALTER TABLE ADD CONSTRANT DDL to support defining primary key and foreign 
> keys
> --
>
> Key: SPARK-21784
> URL: https://issues.apache.org/jira/browse/SPARK-21784
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Suresh Thalamati
>Priority: Major
>
> Currently Spark SQL does not have  DDL support to define primary key , and 
> foreign key constraints. This Jira is to add DDL support to define primary 
> key and foreign key informational constraint using ALTER TABLE syntax. These 
> constraints will be used in query optimization and you can find more details 
> about this in the spec in SPARK-19842
> *Syntax :*
> {code}
> ALTER TABLE [db_name.]table_name ADD [CONSTRAINT constraintName]
>   (PRIMARY KEY (col_names) |
>   FOREIGN KEY (col_names) REFERENCES [db_name.]table_name [(col_names)])
>   [VALIDATE | NOVALIDATE] [RELY | NORELY]
> {code}
> Examples :
> {code:sql}
> ALTER TABLE employee _ADD CONSTRANT pk_ PRIMARY KEY(empno) VALIDATE RELY
> ALTER TABLE department _ADD CONSTRAINT emp_fk_ FOREIGN KEY (mgrno) REFERENCES 
> employee(empno) NOVALIDATE NORELY
> {code}
> *Constraint name generated by the system:*
> {code:sql}
> ALTER TABLE department ADD PRIMARY KEY(deptno) VALIDATE RELY
> ALTER TABLE employee ADD FOREIGN KEY (workdept) REFERENCES department(deptno) 
> VALIDATE RELY;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110103#comment-17110103
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28570

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110102#comment-17110102
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28570

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31748) Document resource module in PySpark doc and rename/move classes

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110064#comment-17110064
 ] 

Apache Spark commented on SPARK-31748:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28569

> Document resource module in PySpark doc and rename/move classes
> ---
>
> Key: SPARK-31748
> URL: https://issues.apache.org/jira/browse/SPARK-31748
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-29641 and SPARK-28234 added new pyspark.resrouce module.
> It should be documented as it's an API.
> Also, the current structure is as follows:
> {code}
> pyspark
> ├── resource
> │   ├── executorrequests.py
> │   │   ├── class ExecutorResourceRequest 
> │   │   └── class ExecutorResourceRequests
> │   ├── taskrequests.py
> │   │   ├── class TaskResourceRequest 
> │   │   └── class TaskResourceRequests
> │   ├── resourceprofilebuilder.py
> │   │   └── class ResourceProfileBuilder
> │   ├── resourceprofile.py
> │   │   └── class ResourceProfile
> └── resourceinformation
>     └── class ResourceInformation
> {code}
> Might better put into fewer and simpler modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31748) Document resource module in PySpark doc and rename/move classes

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31748:


Assignee: (was: Apache Spark)

> Document resource module in PySpark doc and rename/move classes
> ---
>
> Key: SPARK-31748
> URL: https://issues.apache.org/jira/browse/SPARK-31748
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-29641 and SPARK-28234 added new pyspark.resrouce module.
> It should be documented as it's an API.
> Also, the current structure is as follows:
> {code}
> pyspark
> ├── resource
> │   ├── executorrequests.py
> │   │   ├── class ExecutorResourceRequest 
> │   │   └── class ExecutorResourceRequests
> │   ├── taskrequests.py
> │   │   ├── class TaskResourceRequest 
> │   │   └── class TaskResourceRequests
> │   ├── resourceprofilebuilder.py
> │   │   └── class ResourceProfileBuilder
> │   ├── resourceprofile.py
> │   │   └── class ResourceProfile
> └── resourceinformation
>     └── class ResourceInformation
> {code}
> Might better put into fewer and simpler modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31748) Document resource module in PySpark doc and rename/move classes

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110062#comment-17110062
 ] 

Apache Spark commented on SPARK-31748:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/28569

> Document resource module in PySpark doc and rename/move classes
> ---
>
> Key: SPARK-31748
> URL: https://issues.apache.org/jira/browse/SPARK-31748
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-29641 and SPARK-28234 added new pyspark.resrouce module.
> It should be documented as it's an API.
> Also, the current structure is as follows:
> {code}
> pyspark
> ├── resource
> │   ├── executorrequests.py
> │   │   ├── class ExecutorResourceRequest 
> │   │   └── class ExecutorResourceRequests
> │   ├── taskrequests.py
> │   │   ├── class TaskResourceRequest 
> │   │   └── class TaskResourceRequests
> │   ├── resourceprofilebuilder.py
> │   │   └── class ResourceProfileBuilder
> │   ├── resourceprofile.py
> │   │   └── class ResourceProfile
> └── resourceinformation
>     └── class ResourceInformation
> {code}
> Might better put into fewer and simpler modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31748) Document resource module in PySpark doc and rename/move classes

2020-05-18 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31748?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-31748:


Assignee: Apache Spark

> Document resource module in PySpark doc and rename/move classes
> ---
>
> Key: SPARK-31748
> URL: https://issues.apache.org/jira/browse/SPARK-31748
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-29641 and SPARK-28234 added new pyspark.resrouce module.
> It should be documented as it's an API.
> Also, the current structure is as follows:
> {code}
> pyspark
> ├── resource
> │   ├── executorrequests.py
> │   │   ├── class ExecutorResourceRequest 
> │   │   └── class ExecutorResourceRequests
> │   ├── taskrequests.py
> │   │   ├── class TaskResourceRequest 
> │   │   └── class TaskResourceRequests
> │   ├── resourceprofilebuilder.py
> │   │   └── class ResourceProfileBuilder
> │   ├── resourceprofile.py
> │   │   └── class ResourceProfile
> └── resourceinformation
>     └── class ResourceInformation
> {code}
> Might better put into fewer and simpler modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31748) Document resource module in PySpark doc and rename/move classes

2020-05-18 Thread Hyukjin Kwon (Jira)

Hyukjin Kwon created SPARK-31748:


 Summary: Document resource module in PySpark doc and rename/move 
classes
 Key: SPARK-31748
 URL: https://issues.apache.org/jira/browse/SPARK-31748
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 3.0.0
Reporter: Hyukjin Kwon


SPARK-29641 and SPARK-28234 added new pyspark.resrouce module.
It should be documented as it's an API.

Also, the current structure is as follows:

{code}
pyspark
├── resource
│   ├── executorrequests.py
│   │   ├── class ExecutorResourceRequest 
│   │   └── class ExecutorResourceRequests
│   ├── taskrequests.py
│   │   ├── class TaskResourceRequest 
│   │   └── class TaskResourceRequests
│   ├── resourceprofilebuilder.py
│   │   └── class ResourceProfileBuilder
│   ├── resourceprofile.py
│   │   └── class ResourceProfile
└── resourceinformation
    └── class ResourceInformation
{code}

Might better put into fewer and simpler modules.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110054#comment-17110054
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28568

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110043#comment-17110043
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28567

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31710) result is the not the same when query and execute jobs

2020-05-18 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17110044#comment-17110044
 ] 

Apache Spark commented on SPARK-31710:
--

User 'GuoPhilipse' has created a pull request for this issue:
https://github.com/apache/spark/pull/28567

> result is the not the same when query and execute jobs
> --
>
> Key: SPARK-31710
> URL: https://issues.apache.org/jira/browse/SPARK-31710
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: hdp:2.7.7
> spark:2.4.5
>Reporter: philipse
>Priority: Major
>
> Hi Team
> Steps to reproduce.
> {code:java}
> create table test(id bigint);
> insert into test select 1586318188000;
> create table test1(id bigint) partitioned by (year string);
> insert overwrite table test1 partition(year) select 234,cast(id as TIMESTAMP) 
> from test;
> {code}
> let's check the result. 
> Case 1:
> *select * from test1;*
> 234 | 52238-06-04 13:06:400.0
> --the result is wrong
> Case 2:
> *select 234,cast(id as TIMESTAMP) from test;*
>  
> java.lang.IllegalArgumentException: Timestamp format must be -mm-dd 
> hh:mm:ss[.f]
>  at java.sql.Timestamp.valueOf(Timestamp.java:237)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.evaluate(HiveBaseResultSet.java:441)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getColumnValue(HiveBaseResultSet.java:421)
>  at 
> org.apache.hive.jdbc.HiveBaseResultSet.getString(HiveBaseResultSet.java:530)
>  at org.apache.hive.beeline.Rows$Row.(Rows.java:166)
>  at org.apache.hive.beeline.BufferedRows.(BufferedRows.java:43)
>  at org.apache.hive.beeline.BeeLine.print(BeeLine.java:1756)
>  at org.apache.hive.beeline.Commands.execute(Commands.java:826)
>  at org.apache.hive.beeline.Commands.sql(Commands.java:670)
>  at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:974)
>  at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:810)
>  at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:767)
>  at org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:480)
>  at org.apache.hive.beeline.BeeLine.main(BeeLine.java:463)
>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>  at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  at java.lang.reflect.Method.invoke(Method.java:498)
>  at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
>  at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
>  Error: Unrecognized column type:TIMESTAMP_TYPE (state=,code=0)
>  
> I try hive,it works well,and the convert is fine and correct
> {code:java}
> select 234,cast(id as TIMESTAMP) from test;
>  234   2020-04-08 11:56:28
> {code}
> Two questions:
> q1:
> if we forbid this convert,should we keep all cases the same?
> q2:
> if we allow the convert in some cases, should we decide the long length, for 
> the code seems to force to convert to ns with times*100 nomatter how long 
> the data is,if it convert to timestamp with incorrect length, we can raise 
> the error.
> {code:java}
> // // converting seconds to us
> private[this] def longToTimestamp(t: Long): Long = t * 100L{code}
>  
> Thanks!
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20427) Issue with Spark interpreting Oracle datatype NUMBER

2020-05-18 Thread Rafael (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-20427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17107376#comment-17107376
 ] 

Rafael edited comment on SPARK-20427 at 5/18/20, 7:27 AM:
--

Hey guys, 
 I encountered an issue related to precision issues.

Now the code expects the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

 

UPDATE:

Solved with the custom scheme.


was (Author: kyrdan):
Hey guys, 
I encountered an issue related to precision issues.

Now the code expects the Decimal type we need to have in JDBC metadata 
precision and scale. 

[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L402-L414]

 

I found out that in the OracleDB it is valid to have Decimal without these 
data. When I do a query read metadata for such column I'm getting 
DATA_PRECISION = Null, and DATA_SCALE = Null.

Then when I run the `spark-sql` I'm getting such error:
{code:java}
java.lang.IllegalArgumentException: requirement failed: Decimal precision 45 
exceeds max precision 38
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:114)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:465)
at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$org$apache$spark$sql$execution$datasources$jdbc$JdbcUtils$$makeGetter$3$$anonfun$12.apply(JdbcUtils.scala:407)
{code}
Do you have a work around how spark-sql can work with such cases?

> Issue with Spark interpreting Oracle datatype NUMBER
> 
>
> Key: SPARK-20427
> URL: https://issues.apache.org/jira/browse/SPARK-20427
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Alexander Andrushenko
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 2.3.0
>
>
> In Oracle exists data type NUMBER. When defining a filed in a table of type 
> NUMBER the field has two components, precision and scale.
> For example, NUMBER(p,s) has precision p and scale s. 
> Precision can range from 1 to 38.
> Scale can range from -84 to 127.
> When reading such a filed Spark can create numbers with precision exceeding 
> 38. In our case it has created fields with precision 44,
> calculated as sum of the precision (in our case 34 digits) and the scale (10):
> "...java.lang.IllegalArgumentException: requirement failed: Decimal precision 
> 44 exceeds max precision 38...".
> The result was, that a data frame was read from a table on one schema but 
> could not be inserted in the identical table on other schema.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

78 matches

Mail list logo