from:"Marco Gaido \(JIRA\)"

[jira] [Resolved] (LIVY-987) NPE when waiting for thrift session to start timeout.

2023-09-05 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-987.
--
Resolution: Fixed

Issue resolved by https://github.com/apache/incubator-livy/pull/416.

> NPE when waiting for thrift session to start timeout.
> -
>
> Key: LIVY-987
> URL: https://issues.apache.org/jira/browse/LIVY-987
> Project: Livy
>  Issue Type: Bug
>Reporter: Jianzhen Wu
>Assignee: Jianzhen Wu
>Priority: Major
> Fix For: 0.9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> Livy spends 10 min waiting for the session to start. If it takes more than 10 
> minutes to start, it will throw a Timeout exception. There is no cause for 
> the timeout exception. When Livy throws e.getCause, NPE occurs.
> *Livy Code*
> {code:java}
>   Try(Await.result(future, maxSessionWait)) match {
> case Success(session) => session
> case Failure(e) => throw e.getCause
>   } {code}
> *Error Log*
> {code:java}
> 23/08/25 16:01:41 INFO  LivyExecuteStatementOperation: (Error executing 
> query, currentState RUNNING, ,java.lang.NullPointerException)
> 23/08/25 16:01:41 ERROR  LivyExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:186)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:105)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:102)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2.run(LivyExecuteStatementOperation.scala:115)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.getLivySession(LivyThriftSessionManager.scala:99)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient$lzycompute(LivyExecuteStatementOperation.scala:65)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient(LivyExecuteStatementOperation.scala:58)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:173)
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (LIVY-987) NPE when waiting for thrift session to start timeout.

2023-09-05 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated LIVY-987:
-
Fix Version/s: 0.9.0

> NPE when waiting for thrift session to start timeout.
> -
>
> Key: LIVY-987
> URL: https://issues.apache.org/jira/browse/LIVY-987
> Project: Livy
>  Issue Type: Bug
>Reporter: Jianzhen Wu
>Assignee: Jianzhen Wu
>Priority: Major
> Fix For: 0.9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> Livy spends 10 min waiting for the session to start. If it takes more than 10 
> minutes to start, it will throw a Timeout exception. There is no cause for 
> the timeout exception. When Livy throws e.getCause, NPE occurs.
> *Livy Code*
> {code:java}
>   Try(Await.result(future, maxSessionWait)) match {
> case Success(session) => session
> case Failure(e) => throw e.getCause
>   } {code}
> *Error Log*
> {code:java}
> 23/08/25 16:01:41 INFO  LivyExecuteStatementOperation: (Error executing 
> query, currentState RUNNING, ,java.lang.NullPointerException)
> 23/08/25 16:01:41 ERROR  LivyExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:186)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:105)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:102)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2.run(LivyExecuteStatementOperation.scala:115)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.getLivySession(LivyThriftSessionManager.scala:99)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient$lzycompute(LivyExecuteStatementOperation.scala:65)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient(LivyExecuteStatementOperation.scala:58)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:173)
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (LIVY-987) NPE when waiting for thrift session to start timeout.

2023-09-05 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated LIVY-987:
-
Component/s: Thriftserver

> NPE when waiting for thrift session to start timeout.
> -
>
> Key: LIVY-987
> URL: https://issues.apache.org/jira/browse/LIVY-987
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Reporter: Jianzhen Wu
>Assignee: Jianzhen Wu
>Priority: Major
> Fix For: 0.9.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
>  
> Livy spends 10 min waiting for the session to start. If it takes more than 10 
> minutes to start, it will throw a Timeout exception. There is no cause for 
> the timeout exception. When Livy throws e.getCause, NPE occurs.
> *Livy Code*
> {code:java}
>   Try(Await.result(future, maxSessionWait)) match {
> case Success(session) => session
> case Failure(e) => throw e.getCause
>   } {code}
> *Error Log*
> {code:java}
> 23/08/25 16:01:41 INFO  LivyExecuteStatementOperation: (Error executing 
> query, currentState RUNNING, ,java.lang.NullPointerException)
> 23/08/25 16:01:41 ERROR  LivyExecuteStatementOperation: Error running hive 
> query:
> org.apache.hive.service.cli.HiveSQLException: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:186)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:105)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2$$anon$3.run(LivyExecuteStatementOperation.scala:102)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:2038)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation$$anon$2.run(LivyExecuteStatementOperation.scala:115)
>         at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>         at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.getLivySession(LivyThriftSessionManager.scala:99)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient$lzycompute(LivyExecuteStatementOperation.scala:65)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.rpcClient(LivyExecuteStatementOperation.scala:58)
>         at 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation.execute(LivyExecuteStatementOperation.scala:173)
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16069815#comment-16069815
 ] 

Marco Gaido edited comment on SPARK-14516 at 3/27/23 9:42 AM:
--

Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
-[https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view|https://drive.google.com/file/d/0B0Hyo__bG_3fdkNvSVNYX2E3ZU0/view].-
 (please refer to https://arxiv.org/abs/2303.14102).

Please tell me if you have any comment, doubt on it.

Thanks.


was (Author: mgaido):
Hello everybody,

I have a proposal for a very efficient Silhouette implementation in a 
distributed environment. Here you can find the link with all the details of the 
solution. As soon as I will finish all the implementation and the tests I will 
post the PR for this: 
https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view.

Please tell me if you have any comment, doubt on it.

Thanks.


> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14516) Clustering evaluator

2023-03-27 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17705265#comment-17705265
 ] 

Marco Gaido commented on SPARK-14516:
-

As there are some issues with the Google Doc sharing, I prepared a paper that 
explains the metric and its definition, please disregard the previous link and 
refer to [https://arxiv.org/abs/2303.14102] if you are interested. It would be 
also easier for me to update it in the future if needed. Thanks.

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Ruifeng Zheng
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-899) The state of interactive session is always idle when using thrift protocol.

2022-11-15 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17634337#comment-17634337
 ] 

Marco Gaido commented on LIVY-899:
--

Can you reproduce the issue with a UT? If so, please  create a PR with the UT 
and the fix, I am happy to review it. Thanks.

> The state of interactive session is always idle when using thrift protocol.
> ---
>
> Key: LIVY-899
> URL: https://issues.apache.org/jira/browse/LIVY-899
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.8.0
>Reporter: Jianzhen Wu
>Priority: Major
> Attachments: image-2022-11-15-20-42-01-214.png, 
> image-2022-11-15-20-42-25-472.png, image-2022-11-15-20-48-54-653.png
>
>
> In REST API, ReplDriver would broadcast ReplState to RSCClient when handling 
> the ReplJobRequest in stateChangedCallback function.
> !image-2022-11-15-20-42-01-214.png|width=242,height=200!
> But in Thrift service, the RSCDriver does not broadcast ReplState to 
> RSCClient when handling JobRequest.
> !image-2022-11-15-20-42-25-472.png|width=241,height=199!
> I would like to discuss with you how to resolve this issue.
> Here's what I think.
> !image-2022-11-15-20-48-54-653.png|width=228,height=136!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Assigned] (LIVY-899) The state of interactive session is always idle when using thrift protocol.

2022-11-15 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-899:


Assignee: (was: Marco Gaido)

> The state of interactive session is always idle when using thrift protocol.
> ---
>
> Key: LIVY-899
> URL: https://issues.apache.org/jira/browse/LIVY-899
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.8.0
>Reporter: Jianzhen Wu
>Priority: Major
> Attachments: image-2022-11-15-20-42-01-214.png, 
> image-2022-11-15-20-42-25-472.png, image-2022-11-15-20-48-54-653.png
>
>
> In REST API, ReplDriver would broadcast ReplState to RSCClient when handling 
> the ReplJobRequest in stateChangedCallback function.
> !image-2022-11-15-20-42-01-214.png|width=242,height=200!
> But in Thrift service, the RSCDriver does not broadcast ReplState to 
> RSCClient when handling JobRequest.
> !image-2022-11-15-20-42-25-472.png|width=241,height=199!
> I would like to discuss with you how to resolve this issue.
> Here's what I think.
> !image-2022-11-15-20-48-54-653.png|width=228,height=136!
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (SPARK-36673) Incorrect Unions of struct with mismatched field name case

2021-09-06 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17410488#comment-17410488
 ] 

Marco Gaido commented on SPARK-36673:
-

AFAIK, in SQL the names in the struct are case sensitive, while the name of the 
normal fields are not. I am not sure about the right behavior here, but maybe I 
would expect an error at analysis time. Definitely, the current behavior is not 
correct.

> Incorrect Unions of struct with mismatched field name case
> --
>
> Key: SPARK-36673
> URL: https://issues.apache.org/jira/browse/SPARK-36673
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.1, 3.2.0
>Reporter: Shardul Mahadik
>Priority: Major
>
> If a nested field has different casing on two sides of the union, the 
> resultant schema of the union will both fields in its schemaa
> {code:java}
> scala> val df1 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS 
> INNER")))
> df1: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> val df2 = spark.range(2).withColumn("nested", struct(expr("id * 5 AS inner")))
> df2: org.apache.spark.sql.DataFrame = [id: bigint, nested: struct bigint>]
> scala> df1.union(df2).printSchema
> root
>  |-- id: long (nullable = false)
>  |-- nested: struct (nullable = false)
>  ||-- INNER: long (nullable = false)
>  ||-- inner: long (nullable = false)
>  {code}
> This seems like a bug. I would expect that Spark SQL would either just union 
> by index or if the user has requested {{unionByName}}, then it should matched 
> fields case insensitively if {{spark.sql.caseSensitive}} is {{false}}.
> However the output data only has one nested column
> {code:java}
> scala> df1.union(df2).show()
> +---+--+
> | id|nested|
> +---+--+
> |  0|   {0}|
> |  1|   {5}|
> |  0|   {0}|
> |  1|   {5}|
> +---+--+
> {code}
> Trying to project fields of {{nested}} throws an error:
> {code:java}
> scala> df1.union(df2).select("nested.*").show()
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at org.apache.spark.sql.types.StructType.apply(StructType.scala:414)
>   at 
> org.apache.spark.sql.catalyst.expressions.GetStructField.dataType(complexTypeExtractors.scala:108)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.toAttribute(namedExpressions.scala:192)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.$anonfun$output$1(basicLogicalOperators.scala:63)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
>   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:108)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Project.output(basicLogicalOperators.scala:63)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.$anonfun$output$3(basicLogicalOperators.scala:260)
>   at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
>   at scala.collection.immutable.List.foreach(List.scala:392)
>   at scala.collection.TraversableLike.map(TraversableLike.scala:238)
>   at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
>   at scala.collection.immutable.List.map(List.scala:298)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.Union.output(basicLogicalOperators.scala:260)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet$lzycompute(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.plans.QueryPlan.outputSet(QueryPlan.scala:49)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:747)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ColumnPruning$$anonfun$apply$8.applyOrElse(Optimizer.scala:695)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:316)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:171)
>   at 
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:169)
>   at 
>

[jira] [Resolved] (LIVY-754) precision and scale are not encoded in decimal type

2020-05-20 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-754?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-754.
--
Fix Version/s: 0.8.0
 Assignee: Wing Yew Poon
   Resolution: Fixed

Issue resolved by PR: https://github.com/apache/incubator-livy/pull/288.

> precision and scale are not encoded in decimal type
> ---
>
> Key: LIVY-754
> URL: https://issues.apache.org/jira/browse/LIVY-754
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.7.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 0.8.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> The Livy Thrift server support for decimal type in 0.7 is inadequate.
> Before LIVY-699, decimal is mapped to the catch-all string type. With 
> LIVY-699, decimal is mapped to a decimal type that is inadequate in that it 
> does not encode the precision and scale. The type in Livy is represented by a 
> BasicDataType case class which contains a String field, name; and a DataType 
> (an enum) field, dataType. In the case of decimal, the dataType is 
> DataType.DECIMAL. The precision and scale of the decimal is not encoded.
> When the DataType is converted to a TTypeDesc for sending a Thrift response 
> to a client request for result set metadata, the TTypeDesc contains a 
> TPrimitiveTypeEntry(TTypeId.DECIMAL_TYPE) without TTypeQualifiers (which are 
> needed to capture the precision and scale). This results in problems for 
> clients. E.g., if we connect to the Thrift server in beeline and do a select 
> from a table with column of decimal type, we get
> {noformat}
> java.lang.NullPointerException
>   at org.apache.hive.jdbc.JdbcColumn.columnPrecision(JdbcColumn.java:310)
>   at 
> org.apache.hive.jdbc.JdbcColumn.columnDisplaySize(JdbcColumn.java:262)
>   at 
> org.apache.hive.jdbc.HiveResultSetMetaData.getColumnDisplaySize(HiveResultSetMetaData.java:63)
>   at 
> org.apache.hive.beeline.IncrementalRows.(IncrementalRows.java:57)
>   at 
> org.apache.hive.beeline.IncrementalRowsWithNormalization.(IncrementalRowsWithNormalization.java:47)
>   at org.apache.hive.beeline.BeeLine.print(BeeLine.java:2322)
>   at org.apache.hive.beeline.Commands.executeInternal(Commands.java:1026)
>   at org.apache.hive.beeline.Commands.execute(Commands.java:1215)
>   at org.apache.hive.beeline.Commands.sql(Commands.java:1144)
>   at org.apache.hive.beeline.BeeLine.dispatch(BeeLine.java:1497)
>   at org.apache.hive.beeline.BeeLine.execute(BeeLine.java:1355)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1134)
>   at org.apache.hive.beeline.BeeLine.begin(BeeLine.java:1082)
>   at 
> org.apache.hive.beeline.BeeLine.mainWithInputRedirection(BeeLine.java:546)
>   at org.apache.hive.beeline.BeeLine.main(BeeLine.java:528)
> {noformat}
> Note: You have to use "--verbose" with beeline to see the stack trace for the 
> NPE.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (LIVY-752) Livy TS does not accept any connections when limits are set on connections

2020-03-16 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-752.
--
Fix Version/s: 0.8.0
 Assignee: Wing Yew Poon
   Resolution: Fixed

Issue resolved by PR: [https://github.com/apache/incubator-livy/pull/284].

> Livy TS does not accept any connections when limits are set on connections
> --
>
> Key: LIVY-752
> URL: https://issues.apache.org/jira/browse/LIVY-752
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.7.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 0.8.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I set livy.server.thrift.limit.connections.per.user=20 on my Livy Server. 
> When I try to connect to it, I get
> {noformat}
> 2020-02-28 17:13:30,443 WARN 
> org.apache.livy.thriftserver.cli.ThriftBinaryCLIService: Error opening 
> session: 
> java.lang.NullPointerException
>   at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.incrementConnectionsCount(LivyThriftSessionManager.scala:438)
>   at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.incrementConnections(LivyThriftSessionManager.scala:425)
>   at 
> org.apache.livy.thriftserver.LivyThriftSessionManager.openSession(LivyThriftSessionManager.scala:222)
>   at 
> org.apache.livy.thriftserver.LivyCLIService.openSessionWithImpersonation(LivyCLIService.scala:121)
>   at 
> org.apache.livy.thriftserver.cli.ThriftCLIService.getSessionHandle(ThriftCLIService.scala:324)
>   at 
> org.apache.livy.thriftserver.cli.ThriftCLIService.OpenSession(ThriftCLIService.scala:203)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1497)
>   at 
> org.apache.hive.service.rpc.thrift.TCLIService$Processor$OpenSession.getResult(TCLIService.java:1482)
>   at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
>   at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
>   at 
> org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
>   at 
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (LIVY-745) more than 11 concurrent clients lead to java.io.IOException: Unable to connect to provided ports 10000~10010

2020-02-15 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-745.
--
Fix Version/s: 0.7.0
 Assignee: Wing Yew Poon
   Resolution: Fixed

Issue resolved by pull request 
https://github.com/apache/incubator-livy/pull/275.

> more than 11 concurrent clients lead to java.io.IOException: Unable to 
> connect to provided ports 1~10010
> 
>
> Key: LIVY-745
> URL: https://issues.apache.org/jira/browse/LIVY-745
> Project: Livy
>  Issue Type: Bug
>  Components: RSC
>Affects Versions: 0.6.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> In testing scalability of the Livy Thrift server, I am simultaneously 
> starting multiple connections to it. When there are more than 11 connections 
> started simultaneously, the 12th (and subsequent) connection will fail with:
> {noformat}
> 2020-01-10 13:53:28,686 ERROR 
> org.apache.livy.thriftserver.LivyExecuteStatementOperation: Error running 
> hive query: 
> org.apache.hive.service.cli.HiveSQLException: java.lang.RuntimeException: 
> java.io.IOException: Unable to connect to provided ports 1~10010
> {noformat}
> Here is the excerpt from the Livy server log:
> {noformat}
> 2020-01-10 13:53:28,138 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 0: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> ...
> 2020-01-10 13:53:28,147 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 1: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> ...
> 2020-01-10 13:53:28,196 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 2: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> ...
> 2020-01-10 13:53:28,247 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 3: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> ...
> 2020-01-10 13:53:28,304 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 4: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> 2020-01-10 13:53:28,329 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 1 Address already in use
> 2020-01-10 13:53:28,329 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 1 Address already in use
> 2020-01-10 13:53:28,329 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 1 Address already in use
> 2020-01-10 13:53:28,329 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 1 Address already in use
> 2020-01-10 13:53:28,329 INFO org.apache.livy.rsc.rpc.RpcServer: Connected to 
> the port 1
> ...
> 2020-01-10 13:53:28,331 INFO org.apache.livy.rsc.rpc.RpcServer: Connected to 
> the port 10001
> ...
> 2020-01-10 13:53:28,335 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10001 Address already in use
> 2020-01-10 13:53:28,335 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10001 Address already in use
> 2020-01-10 13:53:28,335 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10001 Address already in use
> 2020-01-10 13:53:28,338 INFO org.apache.livy.rsc.rpc.RpcServer: Connected to 
> the port 10002
> 2020-01-10 13:53:28,338 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10002 Address already in use
> ...
> 2020-01-10 13:53:28,339 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10002 Address already in use
> 2020-01-10 13:53:28,341 INFO org.apache.livy.rsc.rpc.RpcServer: Connected to 
> the port 10003
> ...
> 2020-01-10 13:53:28,341 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10003 Address already in use
> ...
> 2020-01-10 13:53:28,343 INFO org.apache.livy.rsc.rpc.RpcServer: Connected to 
> the port 10004
> ...
> 2020-01-10 13:53:28,362 INFO 
> org.apache.livy.server.interactive.InteractiveSession$: Creating Interactive 
> session 5: [owner: systest, request: [kind: spark, proxyUser: None, 
> heartbeatTimeoutInSecond: 0]]
> 2020-01-10 13:53:28,367 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 1 Address already in use
> 2020-01-10 13:53:28,369 DEBUG org.apache.livy.rsc.rpc.RpcServer: RPC not able 
> to connect port 10001 Address already in use
> 2020-01-10 13:53:28,371 DEBUG

[jira] [Comment Edited] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005512#comment-17005512
 ] 

Marco Gaido edited comment on LIVY-718 at 12/30/19 6:29 PM:


bq. IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without 
a fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
Hive Thrift server? Then maybe the hive JDBC client now supports server side 
transitions. If not, then we may have the caveat that HA won't work for such 
connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).


was (Author: mgaido):
> IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
> fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
> Hive Thrift server? Then maybe the hive JDBC client now supports server side 
> transitions. If not, then we may have the caveat that HA won't work for such 
> connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-30 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005512#comment-17005512
 ] 

Marco Gaido commented on LIVY-718:
--

> IIRC JDBC had a REST and an RPC mode. The RPC mode might not be HA without a 
> fat client but perhaps the REST mode could. Does Hive JDBC support HA on the 
> Hive Thrift server? Then maybe the hive JDBC client now supports server side 
> transitions. If not, then we may have the caveat that HA won't work for such 
> connections. I am not super familiar with the JDBC client.

Actually, JDBC has either HTTP or binary mode, but it relates only to the 
communication between the Livy server and the client and it is unrelated to how 
the Livy server interacts with Spark (ie. via RPC).

> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (LIVY-718) Support multi-active high availability in Livy

2019-12-29 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005063#comment-17005063
 ] 

Marco Gaido commented on LIVY-718:
--

Thanks for your comments [~bikassaha]. I am not sure (actually I think it is 
not true) that all the metadata information is present in the Spark driver 
process and there is metadata which is frequently accessed/changed (eg. the 
ongoing statements for a given session) on the Livy server side (at least for 
the thrift part). Indeed, there are definitely metadata which are currently 
kept in the server memory which would need to be saved for HA sake. Hence, I am 
afraid that at least for the thrift case, the usage of a slow storage like HDFS 
would at least would require a significant revisit of the thrift part.

I agree that active-active is by far the most desirable choice. I see, though, 
that it is not easy to implement, IMHO, because for the metadata above, it 
would require a distributed state store being the source of truth for that. 
Given your negative opinion on ZK, I hardly see any other system which would 
fit (a relational DB cluster maybe? but not easier to maintain than ZK for 
sure, I'd say). Hence I am drawn to consider that we would need to trade off 
things here, unless I am very mistaken on the point above: namely, the REST 
part has really no significant metadata on Livy server side and we keep the 
thrift one out of scope here.


> Support multi-active high availability in Livy
> --
>
> Key: LIVY-718
> URL: https://issues.apache.org/jira/browse/LIVY-718
> Project: Livy
>  Issue Type: Epic
>  Components: RSC, Server
>Reporter: Yiheng Wang
>Priority: Major
>
> In this JIRA we want to discuss how to implement multi-active high 
> availability in Livy.
> Currently, Livy only supports single node recovery. This is not sufficient in 
> some production environments. In our scenario, the Livy server serves many 
> notebook and JDBC services. We want to make Livy service more fault-tolerant 
> and scalable.
> There're already some proposals in the community for high availability. But 
> they're not so complete or just for active-standby high availability. So we 
> propose a multi-active high availability design to achieve the following 
> goals:
> # One or more servers will serve the client requests at the same time.
> # Sessions are allocated among different servers.
> # When one node crashes, the affected sessions will be moved to other active 
> services.
> Here's our design document, please review and comment:
> https://docs.google.com/document/d/1bD3qYZpw14_NuCcSGUOfqQ0pqvSbCQsOLFuZp26Ohjc/edit?usp=sharing
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-29667) implicitly convert mismatched datatypes on right side of "IN" operator

2019-12-04 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16987712#comment-16987712
 ] 

Marco Gaido commented on SPARK-29667:
-

I can agree more with you [~hyukjin.kwon]. I think that having different 
coercion rules for the two types of IN is very confusing. It'd be great for 
such things to be consistent among all the framework in order to avoid 
"surprises" for users IMHO.

> implicitly convert mismatched datatypes on right side of "IN" operator
> --
>
> Key: SPARK-29667
> URL: https://issues.apache.org/jira/browse/SPARK-29667
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Jessie Lin
>Priority: Minor
>
> Ran into error on this sql
> Mismatched columns:
> {code}
> [(a.`id`:decimal(28,0), db1.table1.`id`:decimal(18,0))] 
> {code}
> the sql and clause
> {code}
>   AND   a.id in (select id from db1.table1 where col1 = 1 group by id)
> {code}
> Once I cast {{decimal(18,0)}} to {{decimal(28,0)}} explicitly above, the sql 
> ran just fine. Can the sql engine cast implicitly in this case?
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (LIVY-705) Support getting keystore password from Hadoop credential provider for Thrift server

2019-11-06 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-705:


Assignee: Wing Yew Poon

> Support getting keystore password from Hadoop credential provider for Thrift 
> server
> ---
>
> Key: LIVY-705
> URL: https://issues.apache.org/jira/browse/LIVY-705
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Affects Versions: 0.6.0
>Reporter: Wing Yew Poon
>Assignee: Wing Yew Poon
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LIVY-475 added support for getting the keystore password and key password 
> from a Hadoop credential provider file. The keystore password is also needed 
> for SSL/TLS support in the Thrift server. We should extend the support for 
> getting the keystore password from the Hadoop credential provider to the 
> Thrift server as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (LIVY-705) Support getting keystore password from Hadoop credential provider for Thrift server

2019-11-06 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-705?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-705.
--
Fix Version/s: 0.7.0
   Resolution: Fixed

Issue resolved by PR https://github.com/apache/incubator-livy/pull/253.

> Support getting keystore password from Hadoop credential provider for Thrift 
> server
> ---
>
> Key: LIVY-705
> URL: https://issues.apache.org/jira/browse/LIVY-705
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Affects Versions: 0.6.0
>Reporter: Wing Yew Poon
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> LIVY-475 added support for getting the keystore password and key password 
> from a Hadoop credential provider file. The keystore password is also needed 
> for SSL/TLS support in the Thrift server. We should extend the support for 
> getting the keystore password from the Hadoop credential provider to the 
> Thrift server as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (LIVY-356) Add support for LDAP authentication in Livy Server

2019-10-13 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-356.
--
Fix Version/s: 0.7.0
 Assignee: mingchao zhao
   Resolution: Fixed

Issue resolved by https://github.com/apache/incubator-livy/pull/231.

> Add support for LDAP authentication in Livy Server
> --
>
> Key: LIVY-356
> URL: https://issues.apache.org/jira/browse/LIVY-356
> Project: Livy
>  Issue Type: New Feature
>  Components: Server
>Affects Versions: 0.4.0
>Reporter: Janki Akhani
>Assignee: mingchao zhao
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Currently, Livy doesn't support LDAP Authentication from client(sparkmagic) 
> to server(livy). We need to add LDAP authentication as that's preferable 
> method due to security reasons. We won't be able to use Knox for this 
> purpose. That is why I am raising this PR which contains LDAP authentication. 
> I have upgraded hadoop.version in livy-main to 2.8.0 as this version contains 
> LivyAuthenticationHandler. Below I have mentioned link for the same:
> https://insight.io/github.com/apache/hadoop/blob/HEAD/hadoop-common-project/hadoop-auth/src/main/java/org/apache/hadoop/security/authentication/server/LdapAuthenticationHandler.java



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (LIVY-667) Support query a lot of data.

2019-09-23 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935852#comment-16935852
 ] 

Marco Gaido commented on LIVY-667:
--

[~yihengw] yes, that's true. The point is that the data needs to be transferred 
anyway through JDBC. So having very very large datasets going over the wire may 
not very efficient. Moreover, in case you have a single very big partition, you 
can always repartition it and avoid the issue. My point here is that there may 
be workarounds for the use case and I don't expect this problem to be faced in 
usual use cases. So I feel an overkill to design some workarounds for a corner 
case. It is also doable to do the same which is suggested here manually: ie. 
create a table with the result of a query (this writes on HDFS) and then read 
the table...

> Support query a lot of data.
> 
>
> Key: LIVY-667
> URL: https://issues.apache.org/jira/browse/LIVY-667
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.6.0
>Reporter: runzhiwang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When enable livy.server.thrift.incrementalCollect, thrift use toLocalIterator 
> to load one partition at each time instead of the whole rdd to avoid 
> OutOfMemory. However, if the largest partition is too big, the OutOfMemory 
> still occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (LIVY-667) Support query a lot of data.

2019-09-23 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16935616#comment-16935616
 ] 

Marco Gaido commented on LIVY-667:
--

[~runzhiwang] let me cite you in the JIRA body:

> When enable livy.server.thrift.incrementalCollect, thrift use toLocalIterator 
> to load one partition at each time

this means that if the driver is as big as the executors, there should not be 
OOM, since the partition was already held in memory on the executors. And in 
general having a driver large as the executors should not be a big deal in 
terms on memory occupation on the cluster in percentage, as I expect to have 
much more executors than drivers..

> Support query a lot of data.
> 
>
> Key: LIVY-667
> URL: https://issues.apache.org/jira/browse/LIVY-667
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.6.0
>Reporter: runzhiwang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When enable livy.server.thrift.incrementalCollect, thrift use toLocalIterator 
> to load one partition at each time instead of the whole rdd to avoid 
> OutOfMemory. However, if the largest partition is too big, the OutOfMemory 
> still occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (LIVY-683) Livy SQLInterpreter get empty array when extract date format rows to json

2019-09-21 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-683.
--
Resolution: Duplicate

> Livy SQLInterpreter get empty array when extract date format rows to json
> -
>
> Key: LIVY-683
> URL: https://issues.apache.org/jira/browse/LIVY-683
> Project: Livy
>  Issue Type: Bug
>  Components: REPL
>Reporter: Zhefeng Wang
>Priority: Major
> Attachments: image-2019-09-17-14-23-31-715.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I submit a sql with select date format: 
> {code:java}
> select to_date(update_time) from sec_ods.ods_binlog_task_list_d_whole where 
> concat(YEAR, MONTH, DAY )=20190306 limit 10
> {code}
> in livy:
> !image-2019-09-17-14-23-31-715.png|width=429,height=424!
> we can see data is composed by some empty arrays.
> and in spark this sql can get correct result:
> {code:java}
> to_date(sec_ods.ods_binlog_task_list_d_whole.`update_time`)
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (LIVY-683) Livy SQLInterpreter get empty array when extract date format rows to json

2019-09-21 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-683:


Assignee: (was: Zhefeng Wang)

> Livy SQLInterpreter get empty array when extract date format rows to json
> -
>
> Key: LIVY-683
> URL: https://issues.apache.org/jira/browse/LIVY-683
> Project: Livy
>  Issue Type: Bug
>  Components: REPL
>Reporter: Zhefeng Wang
>Priority: Major
> Attachments: image-2019-09-17-14-23-31-715.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> I submit a sql with select date format: 
> {code:java}
> select to_date(update_time) from sec_ods.ods_binlog_task_list_d_whole where 
> concat(YEAR, MONTH, DAY )=20190306 limit 10
> {code}
> in livy:
> !image-2019-09-17-14-23-31-715.png|width=429,height=424!
> we can see data is composed by some empty arrays.
> and in spark this sql can get correct result:
> {code:java}
> to_date(sec_ods.ods_binlog_task_list_d_whole.`update_time`)
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> 2019-03-04
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (LIVY-667) Support query a lot of data.

2019-09-20 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934489#comment-16934489
 ] 

Marco Gaido commented on LIVY-667:
--

[~runzhiwang] I'd argue that if the data which is present in a single partition 
doesn't fit in the driver, you just need to increase the memory on the driver...

> Support query a lot of data.
> 
>
> Key: LIVY-667
> URL: https://issues.apache.org/jira/browse/LIVY-667
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Affects Versions: 0.6.0
>Reporter: runzhiwang
>Priority: Major
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> When enable livy.server.thrift.incrementalCollect, thrift use toLocalIterator 
> to load one partition at each time instead of the whole rdd to avoid 
> OutOfMemory. However, if the largest partition is too big, the OutOfMemory 
> still occurs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (SPARK-29123) DecimalType multiplication precision loss

2019-09-19 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933560#comment-16933560
 ] 

Marco Gaido commented on SPARK-29123:
-

[~benny] the point here is: Spark can represent decimals with max precision 38. 
When defining the result type multiple options are possible.

We decided to follow SQLServer implementation by default (you can check their 
docs too). There isn't much more docs as tehre isn't much more to say. Since 
the precision is limited, there are 2 options:

 - in case you do not allow precision loss, an overflow can happen. In that 
case Spark returns NULL.
 - in case you allow precision loss, precision loss is preferred over overflow. 
This is the behavior SQLServer has and it is ANSI compliant.

You can see the PR for SPARK-22036 for more details.


> DecimalType multiplication precision loss 
> --
>
> Key: SPARK-29123
> URL: https://issues.apache.org/jira/browse/SPARK-29123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Benny Lu
>Priority: Major
>
> When doing multiplication with PySpark, it seems PySpark is losing precision.
> For example, when multiplying two decimals with precision 38,10, it returns 
> 38,6 instead of 38,10. It also truncates result to three decimals which is 
> incorrect result. 
> {code:java}
> from decimal import Decimal
> from pyspark.sql.types import DecimalType, StructType, StructField
> schema = StructType([StructField("amount", DecimalType(38,10)), 
> StructField("fx", DecimalType(38,10))])
> df = spark.createDataFrame([(Decimal(233.00), Decimal(1.1403218880))], 
> schema=schema)
> df.printSchema()
> df = df.withColumn("amount_usd", df.amount * df.fx)
> df.printSchema()
> df.show()
> {code}
> Result
> {code:java}
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df = df.withColumn("amount_usd", df.amount * df.fx)
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df.show()
> +--++--+
> |amount|  fx|amount_usd|
> +--++--+
> |233.00|1.1403218880|265.695000|
> +--++--+
> {code}
> When rounding to two decimals, it returns 265.70 but the correct result 
> should be 265.69499 and when rounded to two decimals, it should be 265.69.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29123) DecimalType multiplication precision loss

2019-09-19 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16933128#comment-16933128
 ] 

Marco Gaido commented on SPARK-29123:
-

You can set {{spark.sql.decimalOperations.allowPrecisionLoss}} if you do not 
want to risk truncations in your operations. Otherwise, tuning properly the 
precision and scale of your input schema helps too.

> DecimalType multiplication precision loss 
> --
>
> Key: SPARK-29123
> URL: https://issues.apache.org/jira/browse/SPARK-29123
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.3
>Reporter: Benny Lu
>Priority: Major
>
> When doing multiplication with PySpark, it seems PySpark is losing precision.
> For example, when multiplying two decimals with precision 38,10, it returns 
> 38,6 instead of 38,10. It also truncates result to three decimals which is 
> incorrect result. 
> {code:java}
> from decimal import Decimal
> from pyspark.sql.types import DecimalType, StructType, StructField
> schema = StructType([StructField("amount", DecimalType(38,10)), 
> StructField("fx", DecimalType(38,10))])
> df = spark.createDataFrame([(Decimal(233.00), Decimal(1.1403218880))], 
> schema=schema)
> df.printSchema()
> df = df.withColumn("amount_usd", df.amount * df.fx)
> df.printSchema()
> df.show()
> {code}
> Result
> {code:java}
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df = df.withColumn("amount_usd", df.amount * df.fx)
> >>> df.printSchema()
> root
>  |-- amount: decimal(38,10) (nullable = true)
>  |-- fx: decimal(38,10) (nullable = true)
>  |-- amount_usd: decimal(38,6) (nullable = true)
> >>> df.show()
> +--++--+
> |amount|  fx|amount_usd|
> +--++--+
> |233.00|1.1403218880|265.695000|
> +--++--+
> {code}
> When rounding to two decimals, it returns 265.70 but the correct result 
> should be 265.69499 and when rounded to two decimals, it should be 265.69.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926650#comment-16926650
 ] 

Marco Gaido commented on SPARK-29038:
-

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-29038) SPIP: Support Spark Materialized View

2019-09-10 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16926650#comment-16926650
 ] 

Marco Gaido edited comment on SPARK-29038 at 9/10/19 1:40 PM:
--

[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My understanding 
is that your proposal is to do something very similar, just with a different 
syntax, more DB oriented. Is my understanding correct?


was (Author: mgaido):
[~cltlfcjin] currently spark has a something similar, which is query caching, 
where the user can also select the level of caching performed. My 
undersatanding is that your proposal is to do something very similar, just with 
a different syntax, more DB oriented. Is my understanding correct?

> SPIP: Support Spark Materialized View
> -
>
> Key: SPARK-29038
> URL: https://issues.apache.org/jira/browse/SPARK-29038
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Lantao Jin
>Priority: Major
>
> Materialized view is an important approach in DBMS to cache data to 
> accelerate queries. By creating a materialized view through SQL, the data 
> that can be cached is very flexible, and needs to be configured arbitrarily 
> according to specific usage scenarios. The Materialization Manager 
> automatically updates the cache data according to changes in detail source 
> tables, simplifying user work. When user submit query, Spark optimizer 
> rewrites the execution plan based on the available materialized view to 
> determine the optimal execution plan.
> Details in [design 
> doc|https://docs.google.com/document/d/1q5pjSWoTNVc9zsAfbNzJ-guHyVwPsEroIEP8Cca179A/edit?usp=sharing]



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2019-09-07 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924894#comment-16924894
 ] 

Marco Gaido commented on LIVY-489:
--

The right place is the user mailing list, please see 
https://livy.apache.org/community/.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 0.6.0
>
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2019-09-07 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924817#comment-16924817
 ] 

Marco Gaido commented on LIVY-489:
--

Yes, sure. The point is: that template in not exhaustive and does not contain 
all the configs which can be set.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 0.6.0
>
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2019-09-07 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924812#comment-16924812
 ] 

Marco Gaido commented on LIVY-489:
--

[~vonatzki] nothing to be sorry about. Please see 
https://github.com/apache/incubator-livy/blob/master/server/src/main/scala/org/apache/livy/LivyConf.scala#L101.
 {{livy.server.thrift.enabled}} must be set to true, the port is controlled by 
https://github.com/apache/incubator-livy/blob/master/server/src/main/scala/org/apache/livy/LivyConf.scala#L109.
 Then you should be able to connect using the beeline you can find in the repo 
or with any Hive 3 client. Thanks.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 0.6.0
>
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (SPARK-29009) Returning pojo from udf not working

2019-09-07 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924774#comment-16924774
 ] 

Marco Gaido commented on SPARK-29009:
-

Why do you think this is a bug? If you want a struct to be returned by your 
UDF, you should return a {{Row}}.

> Returning pojo from udf not working
> ---
>
> Key: SPARK-29009
> URL: https://issues.apache.org/jira/browse/SPARK-29009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Tomasz Belina
>Priority: Major
>
>  It looks like spark is unable to construct row from pojo returned from udf.
> Give POJO:
> {code:java}
> public class SegmentStub {
> private int id;
> private Date statusDateTime;
> private int healthPointRatio;
> }
> {code}
> Registration of the UDF:
> {code:java}
> public class ParseResultsUdf {
> public String registerUdf(SparkSession sparkSession) {
> Encoder encoder = Encoders.bean(SegmentStub.class);
> final StructType schema = encoder.schema();
> sparkSession.udf().register(UDF_NAME,
> (UDF2) (s, s2) -> new 
> SegmentStub(1, Date.valueOf(LocalDate.now()), 2),
> schema
> );
> return UDF_NAME;
> }
> }
> {code}
> Test code:
> {code:java}
> List strings = Arrays.asList(new String[]{"one", "two"},new 
> String[]{"3", "4"});
> JavaRDD rowJavaRDD = 
> sparkContext.parallelize(strings).map(RowFactory::create);
> StructType schema = DataTypes
> .createStructType(new StructField[] { 
> DataTypes.createStructField("foe1", DataTypes.StringType, false),
> DataTypes.createStructField("foe2", 
> DataTypes.StringType, false) });
> Dataset dataFrame = 
> sparkSession.sqlContext().createDataFrame(rowJavaRDD, schema);
> Seq columnSeq = new Set.Set2<>(col("foe1"), 
> col("foe2")).toSeq();
> dataFrame.select(callUDF(udfName, columnSeq)).show();
> {code}
>  throws exception: 
> {code:java}
> Caused by: java.lang.IllegalArgumentException: The value (SegmentStub(id=1, 
> statusDateTime=2019-09-06, healthPointRatio=2)) of the type (udf.SegmentStub) 
> cannot be converted to struct
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:262)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:238)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:103)
>   at 
> org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:396)
>   ... 21 more
> }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-489) Expose a JDBC endpoint for Livy

2019-09-04 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/LIVY-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922363#comment-16922363
 ] 

Marco Gaido commented on LIVY-489:
--

Hi [~vonatzki]. Please check the design doc attached here. Moreover, please 
check the configurations added in LivyConf. You'll see that the thriftserver 
must be enabled and configured properly and default values for the 
configurations are there.

As a suggestion, it is possible for you, I'd recommend to build Livy from 
master branch, even though the thriftserver is present also in the 0.6.0 
version. Indeed, it was the first release for it, so might find issues which 
have been fixed on current master.

Thanks.

> Expose a JDBC endpoint for Livy
> ---
>
> Key: LIVY-489
> URL: https://issues.apache.org/jira/browse/LIVY-489
> Project: Livy
>  Issue Type: New Feature
>  Components: API, Server
>Affects Versions: 0.6.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 0.6.0
>
>
> Many users and BI tools use JDBC connections in order to retrieve data. As 
> Livy exposes only a REST API, this is a limitation in its adoption. Hence, 
> adding a JDBC endpoint may be a very useful feature, which could also make 
> Livy a more attractive solution for end user to adopt.
> Moreover, currently, Spark exposes a JDBC interface, but this has many 
> limitations, including that all the queries are submitted to the same 
> application, therefore there is no isolation/security, which can be offered 
> by Livy, making a Livy JDBC API a better solution for companies/users who 
> want to use Spark in order to run they queries through JDBC.
> In order to make the transition from existing solutions to the new JDBC 
> server seamless, the proposal is to use the Hive thrift-server and extend it 
> as it was done by the STS.
> [Here, you can find the design 
> doc.|https://docs.google.com/document/d/18HAR_VnQLegbYyzGg8f4zwD4GtDP5q_t3K21eXecZC4/edit]
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (SPARK-28610) Support larger buffer for sum of long

2019-09-03 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28610?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16921471#comment-16921471
 ] 

Marco Gaido commented on SPARK-28610:
-

Hi [~Gengliang.Wang]. That's a different thing. you are doing 3 {{Add}}s and 
then a sum of 1 number. To reproduce this, you should create a table with 3 
rows with those vlues and sum them. Thanks.

> Support larger buffer for sum of long
> -
>
> Key: SPARK-28610
> URL: https://issues.apache.org/jira/browse/SPARK-28610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> The sum of a long field currently uses a buffer of type long.
> When the flag for throwing exceptions on overflow for arithmetic operations 
> in turned on, this is a problem in case there are intermediate overflows 
> which are then resolved by other rows. Indeed, in such a case, we are 
> throwing an exception, while the result is representable in a long value. An 
> example of this issue can be seen running:
> {code}
> val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
> df.select(sum($"a")).show()
> {code}
> According to [~cloud_fan]'s suggestion in 
> https://github.com/apache/spark/pull/21599, we should introduce a flag in 
> order to let users choose among a wider datatype for the sum buffer using a 
> config, so that the above issue can be fixed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-28939:

Description: 
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
un-effective.

The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
And this is pretty frequent in the codebase.

Please notice that there are 2 parts of this issue:
 - when a user directly uses those APIs
 - when Spark invokes them (eg. throughout the ML lib and other usages or the 
{{describe}} method on the {{Dataset}} class)



  was:
The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.


> SQL configuration are not always propagated
> ---
>
> Key: SPARK-28939
> URL: https://issues.apache.org/jira/browse/SPARK-28939
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Marco Gaido
>Priority: Major
>
> The SQL configurations are propagated to executors in order to be effective.
> Unfortunately, in some cases, we are missing to propagate them, making them 
> un-effective.
> The problem happens every time {{rdd}} or {{queryExecution.toRdd}} are used. 
> And this is pretty frequent in the codebase.
> Please notice that there are 2 parts of this issue:
>  - when a user directly uses those APIs
>  - when Spark invokes them (eg. throughout the ML lib and other usages or the 
> {{describe}} method on the {{Dataset}} class)



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28939) SQL configuration are not always propagated

2019-09-01 Thread Marco Gaido (Jira)

Marco Gaido created SPARK-28939:
---

 Summary: SQL configuration are not always propagated
 Key: SPARK-28939
 URL: https://issues.apache.org/jira/browse/SPARK-28939
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Marco Gaido


The SQL configurations are propagated to executors in order to be effective.
Unfortunately, in some cases, we are missing to propagate them, making them 
uneffective.

For an example, please see the {{describe}} method on the {{Dataset}} class.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-31 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920153#comment-16920153
 ] 

Marco Gaido commented on SPARK-28916:
-

I think the problem is related to subexpression elimination. I've not been able 
to confirm since for some reasons I am not able to disable it, even though I 
set the config to false, it is performed anyway. Maybe I am missing something 
there. Anyway, you may try and set 
{{spark.sql.subexpressionElimination.enabled}} to {{false}}. Meanwhile I am 
working on a fix. Thanks.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps：
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then，The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
>

[jira] [Commented] (SPARK-28916) Generated SpecificSafeProjection.apply method grows beyond 64 KB when use SparkSQL

2019-08-31 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920074#comment-16920074
 ] 

Marco Gaido commented on SPARK-28916:
-

Thanks for reporting this. I am checking it.

> Generated SpecificSafeProjection.apply method grows beyond 64 KB when use  
> SparkSQL
> ---
>
> Key: SPARK-28916
> URL: https://issues.apache.org/jira/browse/SPARK-28916
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.3
>Reporter: MOBIN
>Priority: Major
>
> Can be reproduced by the following steps：
> 1. Create a table with 5000 fields
> 2. val data=spark.sql("select * from spark64kb limit 10");
> 3. data.describe()
> Then，The following error occurred
> {code:java}
> WARN scheduler.TaskSetManager: Lost task 0.0 in stage 1.0 (TID 0, localhost, 
> executor 1): org.codehaus.janino.InternalCompilerException: failed to 
> compile: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection"
>  grows beyond 64 KB
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1298)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1376)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1373)
>   at 
> org.spark_project.guava.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
>   at 
> org.spark_project.guava.cache.LocalCache$Segment.get(LocalCache.java:2257)
>   at org.spark_project.guava.cache.LocalCache.get(LocalCache.java:4000)
>   at org.spark_project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
>   at 
> org.spark_project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.compile(CodeGenerator.scala:1238)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.create(GenerateMutableProjection.scala:143)
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.GenerateMutableProjection$.generate(GenerateMutableProjection.scala:44)
>   at 
> org.apache.spark.sql.execution.SparkPlan.newMutableProjection(SparkPlan.scala:385)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:96)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3$$anonfun$4.apply(SortAggregateExec.scala:95)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.generateProcessRow(AggregationIterator.scala:180)
>   at 
> org.apache.spark.sql.execution.aggregate.AggregationIterator.(AggregationIterator.scala:199)
>   at 
> org.apache.spark.sql.execution.aggregate.SortBasedAggregationIterator.(SortBasedAggregationIterator.scala:40)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:86)
>   at 
> org.apache.spark.sql.execution.aggregate.SortAggregateExec$$anonfun$doExecute$1$$anonfun$3.apply(SortAggregateExec.scala:77)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndexInternal$1$$anonfun$12.apply(RDD.scala:823)
>   at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
>   at org.apache.spark.scheduler.Task.run(Task.scala:121)
>   at 
> org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
>   at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.codehaus.janino.InternalCompilerException: Compiling 
> "GeneratedClass": Code of method 
> "apply(Ljava/lang/Object;)Ljava/lang/Object;" of class 
>

[jira] [Commented] (SPARK-28934) Add `spark.sql.compatiblity.mode`

2019-08-31 Thread Marco Gaido (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920072#comment-16920072
 ] 

Marco Gaido commented on SPARK-28934:
-

Hi [~smilegator]! Thanks for opening this. I am wondering whether it may be 
worth to reopen SPARK-28610 and enable the option for the pgSQL compatibility 
mode. [~cloud_fan] what do you think?

> Add `spark.sql.compatiblity.mode`
> -
>
> Key: SPARK-28934
> URL: https://issues.apache.org/jira/browse/SPARK-28934
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Xiao Li
>Priority: Major
>
> This issue aims to add `spark.sql.compatiblity.mode` whose values are `spark` 
> or `pgSQL` case-insensitively to control PostgreSQL compatibility features.
>  
> Apache Spark 3.0.0 can start with `spark.sql.parser.ansi.enabled=false` and 
> `spark.sql.compatiblity.mode=spark`.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (LIVY-650) Remove schema from ResultSet

2019-08-29 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-650.
--
Fix Version/s: 0.7.0
   Resolution: Fixed

Issue resolved by https://github.com/apache/incubator-livy/pull/213.

> Remove schema from ResultSet
> 
>
> Key: LIVY-650
> URL: https://issues.apache.org/jira/browse/LIVY-650
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Affects Versions: 0.7.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> The {{ResultSet}} class currently contains the schema despite it is never 
> used.
> We can get rid of it and this is a benefit because we don't serialize this 
> string which may be pretty big.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Resolved] (SPARK-28610) Support larger buffer for sum of long

2019-08-28 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28610?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-28610.
-
Resolution: Won't Fix

Since the perf regression introduced by the change would be very high, this 
won't be fixed. Thanks.

> Support larger buffer for sum of long
> -
>
> Key: SPARK-28610
> URL: https://issues.apache.org/jira/browse/SPARK-28610
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> The sum of a long field currently uses a buffer of type long.
> When the flag for throwing exceptions on overflow for arithmetic operations 
> in turned on, this is a problem in case there are intermediate overflows 
> which are then resolved by other rows. Indeed, in such a case, we are 
> throwing an exception, while the result is representable in a long value. An 
> example of this issue can be seen running:
> {code}
> val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
> df.select(sum($"a")).show()
> {code}
> According to [~cloud_fan]'s suggestion in 
> https://github.com/apache/spark/pull/21599, we should introduce a flag in 
> order to let users choose among a wider datatype for the sum buffer using a 
> config, so that the above issue can be fixed.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (LIVY-650) Remove schema from ResultSet

2019-08-25 Thread Marco Gaido (Jira)

Marco Gaido created LIVY-650:


 Summary: Remove schema from ResultSet
 Key: LIVY-650
 URL: https://issues.apache.org/jira/browse/LIVY-650
 Project: Livy
  Issue Type: Improvement
  Components: Thriftserver
Affects Versions: 0.7.0
Reporter: Marco Gaido
Assignee: Marco Gaido


The {{ResultSet}} class currently contains the schema despite it is never used.
We can get rid of it and this is a benefit because we don't serialize this 
string which may be pretty big.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Assigned] (LIVY-573) Add tests for operation logs retrieval

2019-08-19 Thread Marco Gaido (Jira)



 [ 
https://issues.apache.org/jira/browse/LIVY-573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-573:


Assignee: Yiheng Wang

> Add tests for operation logs retrieval
> --
>
> Key: LIVY-573
> URL: https://issues.apache.org/jira/browse/LIVY-573
> Project: Livy
>  Issue Type: Improvement
>  Components: Tests, Thriftserver
>Reporter: Marco Gaido
>Assignee: Yiheng Wang
>Priority: Trivial
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> Our current tests do not cover the retrieval of operation logs. We should try 
> and add coverage for it if possible.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Commented] (LIVY-627) Implement GetCrossReference metadata operation

2019-08-08 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16902733#comment-16902733
 ] 

Marco Gaido commented on LIVY-627:
--

Hi [~micahzhao]. Actually in Spark there is no primary key/foreign key at all. 
So there is few we can do here I think. Those operations are not supported by 
Spark itself..

> Implement GetCrossReference metadata operation
> --
>
> Key: LIVY-627
> URL: https://issues.apache.org/jira/browse/LIVY-627
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetCrossReference metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (LIVY-575) Implement missing metadata operations

2019-08-06 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900970#comment-16900970
 ] 

Marco Gaido commented on LIVY-575:
--

No problem, thank you. :)

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-627) Implement GetCrossReference metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-627?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-627:


Assignee: (was: Yiheng Wang)

> Implement GetCrossReference metadata operation
> --
>
> Key: LIVY-627
> URL: https://issues.apache.org/jira/browse/LIVY-627
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetCrossReference metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (LIVY-575) Implement missing metadata operations

2019-08-06 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900956#comment-16900956
 ] 

Marco Gaido edited comment on LIVY-575 at 8/6/19 12:12 PM:
---

[~yihengw] we assign JIRAs when the related PRs are merged. You can rather 
leave a comment on them if you plan to work on them so noone else will spend 
time on them. Thanks.


was (Author: mgaido):
[~yihengw] we assign JIRAs when the related PRs are merged. Thanks.

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-628) Implement GetDelegationToken metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-628?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-628:


Assignee: (was: Yiheng Wang)

> Implement GetDelegationToken metadata operation
> ---
>
> Key: LIVY-628
> URL: https://issues.apache.org/jira/browse/LIVY-628
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetDelegationToken metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-626) Implement GetPrimaryKeys metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-626:


Assignee: (was: Yiheng Wang)

> Implement GetPrimaryKeys metadata operation
> ---
>
> Key: LIVY-626
> URL: https://issues.apache.org/jira/browse/LIVY-626
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetPrimaryKeys metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-623) Implement GetTables metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-623?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-623:


Assignee: (was: Yiheng Wang)

> Implement GetTables metadata operation
> --
>
> Key: LIVY-623
> URL: https://issues.apache.org/jira/browse/LIVY-623
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetTables metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-624) Implement GetColumns metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-624:


Assignee: (was: Yiheng Wang)

> Implement GetColumns metadata operation
> ---
>
> Key: LIVY-624
> URL: https://issues.apache.org/jira/browse/LIVY-624
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetColumns metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-625) Implement GetFunctions metadata operation

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-625:


Assignee: (was: Yiheng Wang)

> Implement GetFunctions metadata operation
> -
>
> Key: LIVY-625
> URL: https://issues.apache.org/jira/browse/LIVY-625
> Project: Livy
>  Issue Type: Sub-task
>  Components: Thriftserver
>Reporter: Yiheng Wang
>Priority: Minor
>
> We should support GetFunctions metadata operation in Livy thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-575) Implement missing metadata operations

2019-08-06 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-575:


Assignee: (was: Yiheng Wang)

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (LIVY-575) Implement missing metadata operations

2019-08-06 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16900956#comment-16900956
 ] 

Marco Gaido commented on LIVY-575:
--

[~yihengw] we assign JIRAs when the related PRs are merged. Thanks.

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (SPARK-28611) Histogram's height is diffrent

2019-08-04 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899585#comment-16899585
 ] 

Marco Gaido commented on SPARK-28611:
-

Mmmhthat's weird! How can you get a different result than what was got on 
the CI? How are you running the code?

> Histogram's height is diffrent
> --
>
> Key: SPARK-28611
> URL: https://issues.apache.org/jira/browse/SPARK-28611
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3
>Reporter: Yuming Wang
>Priority: Major
>
> {code:sql}
> CREATE TABLE desc_col_table (key int COMMENT 'column_comment') USING PARQUET;
> -- Test output for histogram statistics
> SET spark.sql.statistics.histogram.enabled=true;
> SET spark.sql.statistics.histogram.numBins=2;
> INSERT INTO desc_col_table values 1, 2, 3, 4;
> ANALYZE TABLE desc_col_table COMPUTE STATISTICS FOR COLUMNS key;
> DESC EXTENDED desc_col_table key;
> {code}
> {noformat}
> spark-sql> DESC EXTENDED desc_col_table key;
> col_name  key
> data_type int
> comment   column_comment
> min   1
> max   4
> num_nulls 0
> distinct_count4
> avg_col_len   4
> max_col_len   4
> histogram height: 4.0, num_of_bins: 2
> bin_0 lower_bound: 1.0, upper_bound: 2.0, distinct_count: 2
> bin_1 lower_bound: 2.0, upper_bound: 4.0, distinct_count: 2
> {noformat}
> But our result is:
> https://github.com/apache/spark/blob/v2.4.3/sql/core/src/test/resources/sql-tests/results/describe-table-column.sql.out#L231-L242



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28610) Support larger buffer for sum of long

2019-08-03 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-28610:
---

 Summary: Support larger buffer for sum of long
 Key: SPARK-28610
 URL: https://issues.apache.org/jira/browse/SPARK-28610
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


The sum of a long field currently uses a buffer of type long.

When the flag for throwing exceptions on overflow for arithmetic operations in 
turned on, this is a problem in case there are intermediate overflows which are 
then resolved by other rows. Indeed, in such a case, we are throwing an 
exception, while the result is representable in a long value. An example of 
this issue can be seen running:

{code}
val df = sc.parallelize(Seq(100L, Long.MaxValue, -1000L)).toDF("a")
df.select(sum($"a")).show()
{code}

According to [~cloud_fan]'s suggestion in 
https://github.com/apache/spark/pull/21599, we should introduce a flag in order 
to let users choose among a wider datatype for the sum buffer using a config, 
so that the above issue can be fixed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28512) New optional mode: throw runtime exceptions on casting failures

2019-07-25 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892692#comment-16892692
 ] 

Marco Gaido commented on SPARK-28512:
-

Thanks for pinging me [~maropu]. It is not the same issue, I think, because in 
SPARK-28470 we are dealing only with cases when there is an overflow. Here 
there is no overflow. Simply the value is not valid for the casted type.

> New optional mode: throw runtime exceptions on casting failures
> ---
>
> Key: SPARK-28512
> URL: https://issues.apache.org/jira/browse/SPARK-28512
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
>
> In popular DBMS like MySQL/PostgreSQL/Oracle, runtime exceptions are thrown 
> on casting, e.g. cast('abc' as int) 
> While in Spark, the result is converted as null silently. It is by design 
> since we don't want a long-running job aborted by some casting failure. But 
> there are scenarios that users want to make sure all the data conversion are 
> correct, like the way they use MySQL/PostgreSQL/Oracle.
> If the changes touch too much code, we can limit the new optional mode to 
> table insertion first. By default the new behavior is disabled.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-575) Implement missing metadata operations

2019-07-23 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890784#comment-16890784
 ] 

Marco Gaido commented on LIVY-575:
--

Oh, you're right. Sorry for the mistake.

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (LIVY-571) When an exception happens running a query, we should report it to end user

2019-07-23 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16890785#comment-16890785
 ] 

Marco Gaido commented on LIVY-571:
--

Issue resolved by PR https://github.com/apache/incubator-livy/pull/182.

> When an exception happens running a query, we should report it to end user
> --
>
> Key: LIVY-571
> URL: https://issues.apache.org/jira/browse/LIVY-571
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Reporter: Marco Gaido
>Assignee: Jeffrey(Xilang) Yan
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a query fails with an exception on livy thriftserver, instead of 
> reporting this exception to the end user, a meaningless one is reported. Eg, 
> with Hive support not enabled on spark, the following query causes:
> {code}
> 0: jdbc:hive2://localhost:10090/> create table test as select a.* from 
> (select 1, "2") a;
> Error: java.lang.RuntimeException: java.util.NoSuchElementException: 
> Statement 820bb5c2-018b-46ea-9b7f-b0e3b9c31c46 not found in session 
> acf3712b-1f08-4111-950f-559fc3f3f10c.
> org.apache.livy.thriftserver.session.ThriftSessionState.statementNotFound(ThriftSessionState.java:118)
> org.apache.livy.thriftserver.session.ThriftSessionState.cleanupStatement(ThriftSessionState.java:107)
> org.apache.livy.thriftserver.session.CleanupStatementJob.call(CleanupStatementJob.java:43)
> org.apache.livy.thriftserver.session.CleanupStatementJob.call(CleanupStatementJob.java:26)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:31)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> Looking at the logs, of course the real problem is:
> {code}
> 19/03/23 10:40:32 ERROR LivyExecuteStatementOperation: Error running hive 
> query: 
> org.apache.hive.service.cli.HiveSQLException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: Hive support is required to CREATE 
> Hive TABLE (AS SELECT);;
> 'CreateTable `test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> ErrorIfExists
> +- Project [1#1, 2#2]
>+- SubqueryAlias `a`
>   +- Project [1 AS 1#1, 2 AS 2#2]
>  +- OneRowRelation
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:386)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:386)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:386)
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> org.apache.livy.thriftserver.session.SqlJob.executeSql(SqlJob.java:74)
> org.apache.livy.thriftserver.session.SqlJob.call(SqlJob.java:64)
> org.apache.livy.thriftserver.session.SqlJob.call(SqlJob.java:35)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:31)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>

[jira] [Issue Comment Deleted] (LIVY-575) Implement missing metadata operations

2019-07-23 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated LIVY-575:
-
Comment: was deleted

(was: Issue resolved by PR https://github.com/apache/incubator-livy/pull/182.)

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-571) When an exception happens running a query, we should report it to end user

2019-07-22 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-571?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-571:


Assignee: Jeffrey(Xilang) Yan

> When an exception happens running a query, we should report it to end user
> --
>
> Key: LIVY-571
> URL: https://issues.apache.org/jira/browse/LIVY-571
> Project: Livy
>  Issue Type: Bug
>  Components: Thriftserver
>Reporter: Marco Gaido
>Assignee: Jeffrey(Xilang) Yan
>Priority: Major
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> When a query fails with an exception on livy thriftserver, instead of 
> reporting this exception to the end user, a meaningless one is reported. Eg, 
> with Hive support not enabled on spark, the following query causes:
> {code}
> 0: jdbc:hive2://localhost:10090/> create table test as select a.* from 
> (select 1, "2") a;
> Error: java.lang.RuntimeException: java.util.NoSuchElementException: 
> Statement 820bb5c2-018b-46ea-9b7f-b0e3b9c31c46 not found in session 
> acf3712b-1f08-4111-950f-559fc3f3f10c.
> org.apache.livy.thriftserver.session.ThriftSessionState.statementNotFound(ThriftSessionState.java:118)
> org.apache.livy.thriftserver.session.ThriftSessionState.cleanupStatement(ThriftSessionState.java:107)
> org.apache.livy.thriftserver.session.CleanupStatementJob.call(CleanupStatementJob.java:43)
> org.apache.livy.thriftserver.session.CleanupStatementJob.call(CleanupStatementJob.java:26)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:31)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> java.lang.Thread.run(Thread.java:748)
> {code}
> Looking at the logs, of course the real problem is:
> {code}
> 19/03/23 10:40:32 ERROR LivyExecuteStatementOperation: Error running hive 
> query: 
> org.apache.hive.service.cli.HiveSQLException: 
> java.util.concurrent.ExecutionException: java.lang.RuntimeException: 
> org.apache.spark.sql.AnalysisException: Hive support is required to CREATE 
> Hive TABLE (AS SELECT);;
> 'CreateTable `test`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, 
> ErrorIfExists
> +- Project [1#1, 2#2]
>+- SubqueryAlias `a`
>   +- Project [1 AS 1#1, 2 AS 2#2]
>  +- OneRowRelation
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:392)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$$anonfun$apply$12.apply(rules.scala:390)
> org.apache.spark.sql.catalyst.trees.TreeNode.foreach(TreeNode.scala:117)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:390)
> org.apache.spark.sql.execution.datasources.HiveOnlyCheck$.apply(rules.scala:388)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:386)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$2.apply(CheckAnalysis.scala:386)
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.checkAnalysis(CheckAnalysis.scala:386)
> org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:95)
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:108)
> org.apache.spark.sql.catalyst.analysis.Analyzer$$anonfun$executeAndCheck$1.apply(Analyzer.scala:105)
> org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
> org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:105)
> org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:57)
> org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:55)
> org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:47)
> org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:79)
> org.apache.spark.sql.SparkSession.sql(SparkSession.scala:642)
> org.apache.livy.thriftserver.session.SqlJob.executeSql(SqlJob.java:74)
> org.apache.livy.thriftserver.session.SqlJob.call(SqlJob.java:64)
> org.apache.livy.thriftserver.session.SqlJob.call(SqlJob.java:35)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:64)
> org.apache.livy.rsc.driver.JobWrapper.call(JobWrapper.java:31)
> java.util.concurrent.FutureTask.run(FutureTask.java:266)
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>

[jira] [Resolved] (LIVY-575) Implement missing metadata operations

2019-07-22 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-575.
--
   Resolution: Fixed
Fix Version/s: 0.7.0

Issue resolved by PR https://github.com/apache/incubator-livy/pull/182.

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
> Fix For: 0.7.0
>
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (SPARK-28470) Honor spark.sql.decimalOperations.nullOnOverflow in Cast

2019-07-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889979#comment-16889979
 ] 

Marco Gaido commented on SPARK-28470:
-

Thanks for checking this Wenchen! I will work on this ASAP. Thanks.

> Honor spark.sql.decimalOperations.nullOnOverflow in Cast
> 
>
> Key: SPARK-28470
> URL: https://issues.apache.org/jira/browse/SPARK-28470
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Priority: Major
>
> cast long to decimal or decimal to decimal can overflow, we should respect 
> the new config if overflow happens.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (LIVY-598) Upgrade Livy jackson dependency to 2.9.9

2019-07-21 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved LIVY-598.
--
   Resolution: Fixed
Fix Version/s: 0.7.0

Issue resolved by https://github.com/apache/incubator-livy/pull/174.

> Upgrade Livy jackson dependency to 2.9.9
> 
>
> Key: LIVY-598
> URL: https://issues.apache.org/jira/browse/LIVY-598
> Project: Livy
>  Issue Type: Improvement
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 0.7.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Upgrade the jackson dependency to 2.9.9 which fixes CVE-2019-12086. 
> Spark has also recently upgraded to jackson version 2.9.9.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (LIVY-598) Upgrade Livy jackson dependency to 2.9.9

2019-07-21 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/LIVY-598?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido reassigned LIVY-598:


Assignee: Arun Mahadevan

> Upgrade Livy jackson dependency to 2.9.9
> 
>
> Key: LIVY-598
> URL: https://issues.apache.org/jira/browse/LIVY-598
> Project: Livy
>  Issue Type: Improvement
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Upgrade the jackson dependency to 2.9.9 which fixes CVE-2019-12086. 
> Spark has also recently upgraded to jackson version 2.9.9.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (SPARK-28225) Unexpected behavior for Window functions

2019-07-20 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889518#comment-16889518
 ] 

Marco Gaido edited comment on SPARK-28225 at 7/20/19 2:43 PM:
--

Let me cite PostgreSQL documentation to explain you the behavior:

??When an aggregate function is used as a window function, it aggregates over 
the rows within the current row's window frame. An aggregate used with ORDER BY 
and the default window frame definition produces a "running sum" type of 
behavior, which may or may not be what's wanted. To obtain aggregation over the 
whole partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING. Other frame specifications can be used to obtain other 
effects.??

So the returned values seem correct to me.


was (Author: mgaido):
Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the 
rows within the current row's window frame. An aggregate used with ORDER BY and 
the default window frame definition produces a "running sum" type of behavior, 
which may or may not be what's wanted. To obtain aggregation over the whole 
partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING. Other frame specifications can be used to obtain other effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28225) Unexpected behavior for Window functions

2019-07-20 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889518#comment-16889518
 ] 

Marco Gaido commented on SPARK-28225:
-

Let me cite PostgreSQL documentation to explain you the behavior:

{noformat}
When an aggregate function is used as a window function, it aggregates over the 
rows within the current row's window frame. An aggregate used with ORDER BY and 
the default window frame definition produces a "running sum" type of behavior, 
which may or may not be what's wanted. To obtain aggregation over the whole 
partition, omit ORDER BY or use ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING. Other frame specifications can be used to obtain other effects.
{noformat}

So the returned values seem correct to me.


> Unexpected behavior for Window functions
> 
>
> Key: SPARK-28225
> URL: https://issues.apache.org/jira/browse/SPARK-28225
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Andrew Leverentz
>Priority: Major
>
> I've noticed some odd behavior when combining the "first" aggregate function 
> with an ordered Window.
> In particular, I'm working with columns created using the syntax
> {code}
> first($"y", ignoreNulls = true).over(Window.orderBy($"x"))
> {code}
> Below, I'm including some code which reproduces this issue in a Databricks 
> notebook.
> *Code:*
> {code:java}
> import org.apache.spark.sql.functions.first
> import org.apache.spark.sql.expressions.Window
> import org.apache.spark.sql.Row
> import org.apache.spark.sql.types.{StructType,StructField,IntegerType}
> val schema = StructType(Seq(
>   StructField("x", IntegerType, false),
>   StructField("y", IntegerType, true),
>   StructField("z", IntegerType, true)
> ))
> val input =
>   spark.createDataFrame(sc.parallelize(Seq(
> Row(101, null, 11),
> Row(102, null, 12),
> Row(103, null, 13),
> Row(203, 24, null),
> Row(201, 26, null),
> Row(202, 25, null)
>   )), schema = schema)
> input.show
> val output = input
>   .withColumn("u1", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u2", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u3", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u4", first($"y", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
>   .withColumn("u5", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc_nulls_last)))
>   .withColumn("u6", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".asc)))
>   .withColumn("u7", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc_nulls_last)))
>   .withColumn("u8", first($"z", ignoreNulls = 
> true).over(Window.orderBy($"x".desc)))
> output.show
> {code}
> *Expectation:*
> Based on my understanding of how ordered-Window and aggregate functions work, 
> the results I expected to see were:
>  * u1 = u2 = constant value of 26
>  * u3 = u4 = constant value of 24
>  * u5 = u6 = constant value of 11
>  * u7 = u8 = constant value of 13
> However, columns u1, u2, u7, and u8 contain some unexpected nulls. 
> *Results:*
> {code:java}
> +---+++++---+---+---+---+++
> |  x|   y|   z|  u1|  u2| u3| u4| u5| u6|  u7|  u8|
> +---+++++---+---+---+---+++
> |203|  24|null|  26|  26| 24| 24| 11| 11|null|null|
> |202|  25|null|  26|  26| 24| 24| 11| 11|null|null|
> |201|  26|null|  26|  26| 24| 24| 11| 11|null|null|
> |103|null|  13|null|null| 24| 24| 11| 11|  13|  13|
> |102|null|  12|null|null| 24| 24| 11| 11|  13|  13|
> |101|null|  11|null|null| 24| 24| 11| 11|  13|  13|
> +---+++++---+---+---+---+++
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28386) Cannot resolve ORDER BY columns with GROUP BY and HAVING

2019-07-20 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889422#comment-16889422
 ] 

Marco Gaido commented on SPARK-28386:
-

I think this is a duplicate of SPARK-26741. I have a PR for it but it seems a 
bit stuck. Any help in reviewing it would be very appreciated. Thanks.

> Cannot resolve ORDER BY columns with GROUP BY and HAVING
> 
>
> Key: SPARK-28386
> URL: https://issues.apache.org/jira/browse/SPARK-28386
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> How to reproduce:
> {code:sql}
> CREATE TABLE test_having (a int, b int, c string, d string) USING parquet;
> INSERT INTO test_having VALUES (0, 1, '', 'A');
> INSERT INTO test_having VALUES (1, 2, '', 'b');
> INSERT INTO test_having VALUES (2, 2, '', 'c');
> INSERT INTO test_having VALUES (3, 3, '', 'D');
> INSERT INTO test_having VALUES (4, 3, '', 'e');
> INSERT INTO test_having VALUES (5, 3, '', 'F');
> INSERT INTO test_having VALUES (6, 4, '', 'g');
> INSERT INTO test_having VALUES (7, 4, '', 'h');
> INSERT INTO test_having VALUES (8, 4, '', 'I');
> INSERT INTO test_having VALUES (9, 4, '', 'j');
> SELECT lower(c), count(c) FROM test_having
>   GROUP BY lower(c) HAVING count(*) > 2
>   ORDER BY lower(c);
> {code}
> {noformat}
> spark-sql> SELECT lower(c), count(c) FROM test_having
>  > GROUP BY lower(c) HAVING count(*) > 2
>  > ORDER BY lower(c);
> Error in query: cannot resolve '`c`' given input columns: [lower(c), 
> count(c)]; line 3 pos 19;
> 'Sort ['lower('c) ASC NULLS FIRST], true
> +- Project [lower(c)#158, count(c)#159L]
>+- Filter (count(1)#161L > cast(2 as bigint))
>   +- Aggregate [lower(c#7)], [lower(c#7) AS lower(c)#158, count(c#7) AS 
> count(c)#159L, count(1) AS count(1)#161L]
>  +- SubqueryAlias test_having
> +- Relation[a#5,b#6,c#7,d#8] parquet
> {noformat}
> But it works when setting an alias:
> {noformat}
> spark-sql> SELECT lower(c) withAias, count(c) FROM test_having
>  > GROUP BY lower(c) HAVING count(*) > 2
>  > ORDER BY withAias;
> 3
>   4
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (LIVY-575) Implement missing metadata operations

2019-07-18 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/LIVY-575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887734#comment-16887734
 ] 

Marco Gaido commented on LIVY-575:
--

HI [~yihengw]. Thanks for your help on this. I think it would be great to have 
them in a single PR. Looking forward to it! Thanks.

> Implement missing metadata operations
> -
>
> Key: LIVY-575
> URL: https://issues.apache.org/jira/browse/LIVY-575
> Project: Livy
>  Issue Type: Improvement
>  Components: Thriftserver
>Reporter: Marco Gaido
>Priority: Minor
>
> Many metadata operations (eg. table list retrieval, schema retrieval, ...) 
> are currently not implemented. We should implement them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap

2019-07-16 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16886406#comment-16886406
 ] 

Marco Gaido commented on SPARK-23758:
-

[~dongjoon] seems weird to set the affected version to 3.0 for this one. Shall 
we rather close it?

> MLlib 2.4 Roadmap
> -
>
> Key: SPARK-23758
> URL: https://issues.apache.org/jira/browse/SPARK-23758
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
>  | (empty) | (any) | yes | no | maybe |
> | [7 | 
>

[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

2019-07-13 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884412#comment-16884412
 ] 

Marco Gaido commented on SPARK-28222:
-

[~eneriwrt] do you have a simple repro for this? I can try and check it if I 
have an example to debug.

> Feature importance outputs different values in GBT and Random Forest in 2.3.3 
> and 2.4 pyspark version
> -
>
> Key: SPARK-28222
> URL: https://issues.apache.org/jira/browse/SPARK-28222
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: eneriwrt
>Priority: Minor
>
> Feature importance values obtained in a binary classification project outputs 
> different values if 2.3.3 version used or 2.4.0. It happens in Random Forest 
> and GBT. Turns out that values that are equal than sklearn output are from 
> 2.3.3 version. 
> As an example:
> *SPARK 2.4*
>  MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 
> 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 
> 0.06324418636918638]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7556, 
> 0.24438, 0.0, 1.4602196686471875e-17, 0.0]
> *SPARK 2.3.3*
>  MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 
> 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 
> 0.05991085303585305]
>  MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 
> 0.055888697733790925]
>  MODEL GradientBoostingClassifier [0.0, 0.7555, 
> 0.24438, 0.0, 2.4326753518951276e-17, 0.0]



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28316) Decimal precision issue

2019-07-13 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884398#comment-16884398
 ] 

Marco Gaido commented on SPARK-28316:
-

Well, IIUC, this is just the result of Postgres having no limit on decimal 
precision, while Spark's Decimal max precision is 38. Our decimal 
implementation draws from SQLServer's (and Hive's, which follows SQLServer) 
one. 

> Decimal precision issue
> ---
>
> Key: SPARK-28316
> URL: https://issues.apache.org/jira/browse/SPARK-28316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Multiply check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10));
> 1179132047626883.596862
> -- PostgreSQL
> postgres=# select cast(-34338492.215397047 as numeric(38, 10)) * 
> cast(-34338492.215397047 as numeric(38, 10));
>?column?
> ---
>  1179132047626883.59686213585632020900
> (1 row)
> {code}
> Division check:
> {code:sql}
> -- Spark SQL
> spark-sql> select cast(93901.57763026 as decimal(38, 10)) / cast(4.31 as 
> decimal(38, 10));
> 21786.908963
> -- PostgreSQL
> postgres=# select cast(93901.57763026 as numeric(38, 10)) / cast(4.31 as 
> numeric(38, 10));
>   ?column?
> 
>  21786.908962937355
> (1 row)
> {code}
> POWER(10, LN(value)) check:
> {code:sql}
> -- Spark SQL
> spark-sql> SELECT CAST(POWER(cast('10' as decimal(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as decimal(38, 10)),200 AS 
> decimal(38, 10));
> 107511333880051856
> -- PostgreSQL
> postgres=# SELECT CAST(POWER(cast('10' as numeric(38, 18)), 
> LN(ABS(round(cast(-24926804.04504742 as numeric(38, 10)),200 AS 
> numeric(38, 10));
>  power
> ---
>  107511333880052007.0414112467
> (1 row)
> {code}
> AVG, STDDEV and VARIANCE returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> create temporary view t1 as select * from values
>  >   (cast(-24926804.04504742 as decimal(38, 10))),
>  >   (cast(16397.038491 as decimal(38, 10))),
>  >   (cast(7799461.4119 as decimal(38, 10)))
>  >   as t1(t);
> spark-sql> SELECT AVG(t), STDDEV(t), VARIANCE(t) FROM t1;
> -5703648.53155214 1.7096528995154984E72.922913036821751E14
> -- PostgreSQL
> postgres=# SELECT AVG(t), STDDEV(t), VARIANCE(t)  from (values 
> (cast(-24926804.04504742 as decimal(38, 10))), (cast(16397.038491 as 
> decimal(38, 10))), (cast(7799461.4119 as decimal(38, 10 t1(t);
>   avg  |stddev |   
> variance
> ---+---+--
>  -5703648.53155214 | 17096528.99515498420743029415 | 
> 292291303682175.094017569588
> (1 row)
> {code}
> EXP returns double type:
> {code:sql}
> -- Spark SQL
> spark-sql> select exp(cast(1.0 as decimal(31,30)));
> 2.718281828459045
> -- PostgreSQL
> postgres=# select exp(cast(1.0 as decimal(31,30)));
>exp
> --
>  2.718281828459045235360287471353
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28324) The LOG function using 10 as the base, but Spark using E

2019-07-13 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884397#comment-16884397
 ] 

Marco Gaido commented on SPARK-28324:
-

+1 for [~srowen]'s opinion. I don't think it is a good idea to change the 
behavior here.

> The LOG function using 10 as the base, but Spark using E
> 
>
> Key: SPARK-28324
> URL: https://issues.apache.org/jira/browse/SPARK-28324
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> select log(10);
> 2.302585092994046
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# select log(10);
>  log
> -
>1
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883165#comment-16883165
 ] 

Marco Gaido commented on SPARK-28348:
-

Mmmhyes, you're right.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882773#comment-16882773
 ] 

Marco Gaido commented on SPARK-28348:
-

No, I don't think that's a good idea. Setting the result type to {{decimal(p3, 
s3)}} could cause the overflow issues for which 
https://issues.apache.org/jira/browse/SPARK-22036 has been done. On the other 
side, avoiding the cast would change the result type. So I don't see a good way 
to change this.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882741#comment-16882741
 ] 

Marco Gaido commented on SPARK-28348:
-

There is no cast to `decimal(38, 6)`. The reson why the result is "truncated" 
at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-28348) Avoid cast twice for decimal type

2019-07-11 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16882741#comment-16882741
 ] 

Marco Gaido edited comment on SPARK-28348 at 7/11/19 8:23 AM:
--

There is no cast to {{decimal(38, 6)}}. The reason why the result is 
"truncated" at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.


was (Author: mgaido):
There is no cast to `decimal(38, 6)`. The reson why the result is "truncated" 
at the 6th scale number is explained in 
https://issues.apache.org/jira/browse/SPARK-22036 and it it controlled by 
{{spark.sql.decimalOperations.allowPrecisionLoss}}. I see no issue here 
honestly.

> Avoid cast twice for decimal type
> -
>
> Key: SPARK-28348
> URL: https://issues.apache.org/jira/browse/SPARK-28348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Yuming Wang
>Priority: Major
>
> Spark 2.1:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862135856320209  
>   |
> +---+
> {code}
> Spark 2.3:
> {code:scala}
> scala> sql("select cast(cast(-34338492.215397047 as decimal(38, 10)) * 
> cast(-34338492.215397047 as decimal(38, 10)) as decimal(38, 
> 18))").show(false);
> +---+
> |CAST((CAST(-34338492.215397047 AS DECIMAL(38,10)) * CAST(-34338492.215397047 
> AS DECIMAL(38,10))) AS DECIMAL(38,18))|
> +---+
> |1179132047626883.596862  
>   |
> +---+
> {code}
> I think we do not need to cast result to {{decimal(38, 6)}} and then cast 
> result to {{decimal(38, 18)}} for this case.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28322) DIV support decimal type

2019-07-10 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16881798#comment-16881798
 ] 

Marco Gaido commented on SPARK-28322:
-

Thanks for pinging me [~yumwang], I'll work on this on the weekend. Thanks!

> DIV support decimal type
> 
>
> Key: SPARK-28322
> URL: https://issues.apache.org/jira/browse/SPARK-28322
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> Spark SQL:
> {code:sql}
> spark-sql> SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
> Error in query: cannot resolve '(CAST(10 AS DECIMAL(10,0)) div CAST(3 AS 
> DECIMAL(10,0)))' due to data type mismatch: '(CAST(10 AS DECIMAL(10,0)) div 
> CAST(3 AS DECIMAL(10,0)))' requires integral type, not decimal(10,0); line 1 
> pos 7;
> 'Project [unresolvedalias((cast(10 as decimal(10,0)) div cast(3 as 
> decimal(10,0))), None)]
> +- OneRowRelation
> {code}
> PostgreSQL:
> {code:sql}
> postgres=# SELECT DIV(CAST(10 AS DECIMAL), CAST(3 AS DECIMAL));
>  div
> -
>3
> (1 row)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-07-08 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16880700#comment-16880700
 ] 

Marco Gaido commented on SPARK-28067:
-

I cannot reproduce in 2.4.0 either:

{code}

spark-2.4.0-bin-hadoop2.7 xxx$ ./bin/spark-shell
2019-07-08 22:52:11 WARN  NativeCodeLoader:62 - Unable to load native-hadoop 
library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://xxx:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1562619141279).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.4.0
  /_/
 
Using Scala version 2.11.12 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
+---+
|sum(decNum)|
+---+
|null   |
+---+
{code}

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the

[jira] [Created] (SPARK-28235) Decimal sum return type

2019-07-02 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-28235:
---

 Summary: Decimal sum return type
 Key: SPARK-28235
 URL: https://issues.apache.org/jira/browse/SPARK-28235
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


Our implementation of decimal operations follows SQLServer behavior. As per 
https://docs.microsoft.com/it-it/sql/t-sql/functions/sum-transact-sql?view=sql-server-2017,
 the result of sum operation should be `DECIMAL(38, s)` while currently we are 
setting it to `DECIMAL(10 + p, s)`. This means that with large datasets, we may 
incur in overflow, even though we may have been able to represent the value 
with higher precision and SQLServer returns correct results in that case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28222) Feature importance outputs different values in GBT and Random Forest in 2.3.3 and 2.4 pyspark version

2019-07-02 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16877267#comment-16877267
 ] 

Marco Gaido commented on SPARK-28222:
-

Mmmmh, there has been a bug fix for it (see SPARK-26721), but it should be in 
3.0 only AFAIK. The question is: which is the rigth value? Can you compare it 
with other libs like sklearn?

> Feature importance outputs different values in GBT and Random Forest in 2.3.3 
> and 2.4 pyspark version
> -
>
> Key: SPARK-28222
> URL: https://issues.apache.org/jira/browse/SPARK-28222
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3
>Reporter: eneriwrt
>Priority: Minor
>
> Feature importance values obtained in a binary classification project outputs 
> different values if 2.3.3 version used or 2.4.0. It happens in Random Forest 
> and GBT.
> As an example:
> *SPARK 2.4*
> MODEL RandomForestClassifier_gini [0.0, 0.4117930839002269, 
> 0.06894132653061226, 0.15857667209786705, 0.2974447311021076, 
> 0.06324418636918638]
> MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.17433924485055197, 0.31754597164210124, 
> 0.055888697733790925]
> MODEL GradientBoostingClassifier [0.0, 0.7556, 
> 0.24438, 0.0, 1.4602196686471875e-17, 0.0]
> *SPARK 2.3.3*
> MODEL RandomForestClassifier_gini [0.0, 0.40957086167800455, 
> 0.06894132653061226, 0.16413222765342259, 0.2974447311021076, 
> 0.05991085303585305]
> MODEL RandomForestClassifier_entropy [0.0, 0.3864372497988694, 
> 0.06578883597468652, 0.18789704501922055, 0.30398817147343266, 
> 0.055888697733790925]
> MODEL GradientBoostingClassifier [0.0, 0.7555, 
> 0.24438, 0.0, 2.4326753518951276e-17, 0.0]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-07-01 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876509#comment-16876509
 ] 

Marco Gaido commented on SPARK-28186:
-

You're right with that. The equivalent in Postgres is {{=ANY}} which behaves 
like current Spark. So I don't see a string motivation to change the current 
Spark behavior.

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-07-01 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16876357#comment-16876357
 ] 

Marco Gaido commented on SPARK-28186:
-

Do you know of any SQL BD with the behavior you are suggesting?

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28186) array_contains returns null instead of false when one of the items in the array is null

2019-06-29 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16875447#comment-16875447
 ] 

Marco Gaido commented on SPARK-28186:
-

This is the right behavior AFAIK. Why are you saying it is wrong?

> array_contains returns null instead of false when one of the items in the 
> array is null
> ---
>
> Key: SPARK-28186
> URL: https://issues.apache.org/jira/browse/SPARK-28186
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Alex Kushnir
>Priority: Major
>
> If array of items contains a null item then array_contains returns true if 
> item is found but if item is not found it returns null instead of false
> Seq(
> (1, Seq("a", "b", "c")),
> (2, Seq("a", "b", null, "c"))
> ).toDF("id", "vals").createOrReplaceTempView("tbl")
> spark.sql("select id, vals, array_contains(vals, 'a') as has_a, 
> array_contains(vals, 'd') as has_d from tbl").show
>  ++-++--+
> |id|vals|has_a|has_d|
> ++-++--+
> |1|[a, b, c]|true|false|
> |2|[a, b,, c]|true|null|
> ++-++--+



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28201?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874765#comment-16874765
 ] 

Marco Gaido commented on SPARK-28201:
-

I'll create a PR for this ASAP.

> Revisit MakeDecimal behavior on overflow
> 
>
> Key: SPARK-28201
> URL: https://issues.apache.org/jira/browse/SPARK-28201
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Marco Gaido
>Priority: Major
>
> As pointed out in 
> https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
> cases of decimal aggregation we are using the `MakeDecimal` operator.
> This operator has a not well defined behavior in case of overflow, namely 
> what it does currently is:
>  - if codegen is enabled it returns null;
>  -  in interpreted mode it throws an `IllegalArgumentException`.
> So we should make his behavior uniform with other similar cases and in 
> particular we should honor the value of the conf introduced in SPARK-23179 
> and behave accordingly, ie.:
>  - returning null if the flag is true;
>  - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28201) Revisit MakeDecimal behavior on overflow

2019-06-28 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-28201:
---

 Summary: Revisit MakeDecimal behavior on overflow
 Key: SPARK-28201
 URL: https://issues.apache.org/jira/browse/SPARK-28201
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in 
https://github.com/apache/spark/pull/20350#issuecomment-505997469, in special 
cases of decimal aggregation we are using the `MakeDecimal` operator.

This operator has a not well defined behavior in case of overflow, namely what 
it does currently is:

 - if codegen is enabled it returns null;
 -  in interpreted mode it throws an `IllegalArgumentException`.

So we should make his behavior uniform with other similar cases and in 
particular we should honor the value of the conf introduced in SPARK-23179 and 
behave accordingly, ie.:

 - returning null if the flag is true;
 - throw an `ArithmeticException` if the flag is false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-28200) Overflow handling in `ExpressionEncoder`

2019-06-28 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-28200:
---

 Summary: Overflow handling in `ExpressionEncoder`
 Key: SPARK-28200
 URL: https://issues.apache.org/jira/browse/SPARK-28200
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Marco Gaido


As pointed out in https://github.com/apache/spark/pull/20350, we are currently 
not checking the overflow when serializing a java/scala `BigDecimal` in 
`ExpressionEncoder` / `ScalaReflection`.

We should add this check there too.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-24 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870880#comment-16870880
 ] 

Marco Gaido commented on SPARK-28067:
-

No, it is the same. Are you sure about your configs?

{code}
macmarco:spark mark9$ git log -5 --oneline
5ad1053f3e (HEAD, apache/master) [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs 
skip empty partitions
113f8c8d13 [SPARK-28132][PYTHON] Update document type conversion for Pandas 
UDFs (pyarrow 0.13.0, pandas 0.24.2, Python 3.7)
9b9d81b821 [SPARK-28131][PYTHON] Update document type conversion between Python 
data and SQL types in normal UDFs (Python 3.7)
54da3bbfb2 [SPARK-28127][SQL] Micro optimization on TreeNode's mapChildren 
method
47f54b1ec7 [SPARK-28118][CORE] Add `spark.eventLog.compression.codec` 
configuration
macmarco:spark mark9$ ./bin/spark-shell 
19/06/24 09:17:58 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Spark context Web UI available at http://.:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1561360686725).
Spark session available as 'spark'.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
  /_/
 
Using Scala version 2.12.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_152)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = Seq(
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 1),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2),
 |  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
"intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.explain
== Physical Plan ==
*(2) HashAggregate(keys=[], functions=[sum(decNum#14)])
+- Exchange SinglePartition
   +- *(1) HashAggregate(keys=[], functions=[partial_sum(decNum#14)])
  +- *(1) Project [decNum#14]
 +- *(1) BroadcastHashJoin [intNum#8], [intNum#15], Inner, BuildLeft
:- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, 
int, false] as bigint)))
:  +- LocalTableScan [intNum#8]
+- LocalTableScan [decNum#14, intNum#15]



scala> df2.show(40,false)
+---+   
|sum(decNum)|
+---+
|null   |
+---+
{code}

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be

[jira] [Commented] (SPARK-28135) ceil/ceiling/floor/power returns incorrect values

2019-06-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870360#comment-16870360
 ] 

Marco Gaido commented on SPARK-28135:
-

[~Tonix517] tickets are assigned only once the PR is merged and the ticket is 
close. So please go ahead submitting the PR and the committer who will 
eventually merge it will assign the ticket to you. Thanks.

> ceil/ceiling/floor/power returns incorrect values
> -
>
> Key: SPARK-28135
> URL: https://issues.apache.org/jira/browse/SPARK-28135
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> {noformat}
> spark-sql> select ceil(double(1.2345678901234e+200)), 
> ceiling(double(1.2345678901234e+200)), floor(double(1.2345678901234e+200)), 
> power('1', 'NaN');
> 9223372036854775807   9223372036854775807 9223372036854775807 NaN
> {noformat}
> {noformat}
> postgres=# select ceil(1.2345678901234e+200::float8), 
> ceiling(1.2345678901234e+200::float8), floor(1.2345678901234e+200::float8), 
> power('1', 'NaN');
>  ceil |   ceiling|floor | power
> --+--+--+---
>  1.2345678901234e+200 | 1.2345678901234e+200 | 1.2345678901234e+200 | 1
> (1 row)
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

2019-06-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870293#comment-16870293
 ] 

Marco Gaido commented on SPARK-28067:
-

I cannot reproduce on master. It always returns null with whole stage codegen 
enabled.

> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0, 2.4.0
> Environment: Ubuntu LTS 16.04
> Oracle Java 1.8.0_201
> spark-2.4.3-bin-without-hadoop
> spark-shell
>Reporter: Mark Sirek
>Priority: Minor
>  Labels: correctness
>
> The following test case involving a join followed by a sum aggregation 
> returns the wrong answer for the sum:
>  
> {code:java}
> val df = Seq(
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 1),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2),
>  (BigDecimal("1000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, 
> "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
>  ---
> sum(decNum)
> ---
> 4000.00
> ---
>  
> {code}
>  
> The result should be 104000..
> It appears a partial sum is computed for each join key, as the result 
> returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>  
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
>  df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: 
> decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, 
> "intNum").agg(sum("decNum"))
>  df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
>  ---
> sum(decNum)
> ---
> null
> ---
>  
> {code}
>  
> The correct answer, 10., doesn't fit in 
> the DataType picked for the result, decimal(38,18), so an overflow occurs, 
> which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should 
> also return null, indicating overflow, but it doesn't.  This may mislead the 
> user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an 
> exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 
> exceeds max precision 38
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28060) Float/Double type can not accept some special inputs

2019-06-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870288#comment-16870288
 ] 

Marco Gaido commented on SPARK-28060:
-

This is a duplicate of SPARK-27768, isn't it? Or better, SPARK-27768 is a 
subpart of this? Anyway, shall we close either this one or SPARK-27768?

> Float/Double type can not accept some special inputs
> 
>
> Key: SPARK-28060
> URL: https://issues.apache.org/jira/browse/SPARK-28060
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Query||Spark SQL||PostgreSQL||
> |SELECT float('nan');|NULL|NaN|
> |SELECT float('   NAN  ');|NULL|NaN|
> |SELECT float('infinity');|NULL|Infinity|
> |SELECT float('  -INFINiTY   ');|NULL|-Infinity|
> ||Query||Spark SQL||PostgreSQL||
> |SELECT double('nan');|NULL|NaN|
> |SELECT double('   NAN  ');|NULL|NaN|
> |SELECT double('infinity');|NULL|Infinity|
> |SELECT double('  -INFINiTY   ');|NULL|-Infinity|



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27820) case insensitive resolver should be used in GetMapValue

2019-06-22 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16870286#comment-16870286
 ] 

Marco Gaido commented on SPARK-27820:
-

+1 for [~hyukjin.kwon]'s comment.

> case insensitive resolver should be used in GetMapValue
> ---
>
> Key: SPARK-27820
> URL: https://issues.apache.org/jira/browse/SPARK-27820
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.1
>Reporter: Michel Lemay
>Priority: Minor
>
> When extracting a key value from a MapType, it calls GetMapValue 
> (complexTypeExtractors.scala) and only use the map type ordering. It should 
> use the resolver instead.
> Starting spark with: `{{spark-shell --conf spark.sql.caseSensitive=false`}}
> Given dataframe:
>  {{val df = List(Map("a" -> 1), Map("A" -> 2)).toDF("m")}}
> And executing any of these will only return one row: case insensitive in the 
> name of the column but case sensitive match in the keys of the map.
> {{df.filter($"M.A".isNotNull).count}}
>  {{df.filter($"M"("A").isNotNull).count 
> df.filter($"M".getField("A").isNotNull).count}}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28024) Incorrect numeric values when out of range

2019-06-21 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-28024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16869879#comment-16869879
 ] 

Marco Gaido commented on SPARK-28024:
-

[~joshrosen] thanks for linking them, Yes, I did try having them in, because it 
is a very confusing and unexpected behavior for many users, especially when 
migrating workloads from other SQL systems. Moreover, having them as configs 
lets users choose the behavior they prefer. But I received negative feedbacks 
on them as you can see. I hope that since there have been several other people 
sustaining this is a problem, those PRs may be reconsidered. Moreover, now 
we're approaching 3.0, so it may be a good moment for them. I am updating them 
resolving conflicts. This issue may be closed as a duplicate IMHO. Thanks.

> Incorrect numeric values when out of range
> --
>
> Key: SPARK-28024
> URL: https://issues.apache.org/jira/browse/SPARK-28024
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Critical
>  Labels: correctness
> Attachments: SPARK-28024.png
>
>
> For example
> Case 1:
> {code:sql}
> select tinyint(128) * tinyint(2); -- 0
> select smallint(2147483647) * smallint(2); -- -2
> select int(2147483647) * int(2); -- -2
> SELECT smallint((-32768)) * smallint(-1); -- -32768
> {code}
> Case 2:
> {code:sql}
> spark-sql> select cast('10e-70' as float), cast('-10e-70' as float);
> 0.0   -0.0
> {code}
> Case 3:
> {code:sql}
> spark-sql> select cast('10e-400' as double), cast('-10e-400' as double);
> 0.0   -0.0
> {code}
> Case 4:
> {code:sql}
> spark-sql> select exp(-1.2345678901234E200);
> 0.0
> postgres=# select exp(-1.2345678901234E200);
> ERROR:  value overflows numeric format
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-30 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851896#comment-16851896
 ] 

Marco Gaido commented on SPARK-24149:
-

That's true, the point is: if you want to access a different cluster, it is 
pretty natural to think that you need to add information for accessing 
credentials. But it is pretty weird if accessing different paths of the same 
cluster, you have to specify several configs.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2019-05-25 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16848097#comment-16848097
 ] 

Marco Gaido commented on SPARK-24149:
-

[~Dhruve Ashar] the use case for this change, for instance, is when you have a  
partitioned table, when the partitions are on different namespaces and there is 
no viewFS configured. In that case, a user running a query on that table, may 
or may not get an exception when reading it. Please, notice that a user running 
a query may be different from the user creating it, so he/she may also not be 
aware of this situation and understanding what is the problem may be pretty 
hard.

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27761) Make UDF nondeterministic by default(?)

2019-05-20 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16843914#comment-16843914
 ] 

Marco Gaido commented on SPARK-27761:
-

Yes, I think this is a good idea IMHO. The behavior by default would be what 
most users expect.

> Make UDF nondeterministic by default(?)
> ---
>
> Key: SPARK-27761
> URL: https://issues.apache.org/jira/browse/SPARK-27761
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Sunitha Kambhampati
>Priority: Minor
>
> Opening this issue as a followup from a discussion/question on this PR for an 
> optimization involving deterministic udf: 
> https://github.com/apache/spark/pull/24593#pullrequestreview-237361795  
> "We even should discuss whether all UDFs must be deterministic or 
> non-deterministic by default."
> Basically today in Spark 2.4,  Scala UDFs are marked deterministic by default 
> and it is implicit. To mark a udf as non deterministic, they need to call 
> this method asNondeterministic().
> The concern's expressed are that users are not aware of this property and its 
> implications.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-15 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16840242#comment-16840242
 ] 

Marco Gaido commented on SPARK-27684:
-

I can try and work on it, but most likely I will start working on it next week.

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27684) Reduce ScalaUDF conversion overheads for primitives

2019-05-14 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839297#comment-16839297
 ] 

Marco Gaido commented on SPARK-27684:
-

I agree on this too.

> Reduce ScalaUDF conversion overheads for primitives
> ---
>
> Key: SPARK-27684
> URL: https://issues.apache.org/jira/browse/SPARK-27684
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Josh Rosen
>Priority: Major
>
> I believe that we can reduce ScalaUDF overheads when operating over primitive 
> types.
> In [ScalaUDF's 
> doGenCode|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ScalaUDF.scala#L991]
>  we have logic to convert UDF function input types from Catalyst internal 
> types to Scala types (for example, this is used to convert UTF8Strings to 
> Java Strings). Similarly, we convert UDF return types.
> However, UDF input argument conversion is effectively a no-op for primitive 
> types because {{CatalystTypeConverters.createToScalaConverter()}} returns 
> {{identity}} in those cases. UDF result conversion is a little tricker 
> because {{createToCatalystConverter()}} returns [a 
> function|https://github.com/apache/spark/blob/5a8aad01c2aaf0ceef8e9a3cfabbd2e88c8d9f0d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystTypeConverters.scala#L413]
>  that handles {{Option[Primitive]}}, but it might be the case that the 
> Option-boxing is unusable via ScalaUDF (in which case the conversion truly is 
> an {{identity}} no-op).
> These unnecessary no-op conversions could be quite expensive because each 
> call involves an index into the {{references}} array to get the converters, a 
> second index into the converters array to get the correct converter for the 
> nth input argument, and, finally, the converter invocation itself:
> {code:java}
> Object project_arg_0 = false ? null : ((scala.Function1[]) references[1] /* 
> converters */)[0].apply(project_value_3);{code}
> In these cases, I believe that we can reduce lookup / invocation overheads by 
> modifying the ScalaUDF code generation to eliminate the conversion calls for 
> primitives and directly assign the unconverted result, e.g.
> {code:java}
> Object project_arg_0 = false ? null : project_value_3;{code}
> To cleanly handle the case where we have a multi-argument UDF accepting a 
> mixture of primitive and non-primitive types, we might be able to keep the 
> {{converters}} array the same size (so indexes stay the same) but omit the 
> invocation of the converters for the primitive arguments (e.g. {{converters}} 
> is sparse / contains unused entries in case of primitives).
> I spotted this optimization while trying to construct some quick benchmarks 
> to measure UDF invocation overheads. For example:
> {code:java}
> spark.udf.register("identity", (x: Int) => x)
> sql("select id, id * 2, id * 3 from range(1000 * 1000 * 1000)").rdd.count() 
> // ~ 52 seconds
> sql("select identity(id), identity(id * 2), identity(id * 3) from range(1000 
> * 1000 * 1000)").rdd.count() // ~84 seconds{code}
> I'm curious to see whether the optimization suggested here can close this 
> performance gap. It'd also be a good idea to construct more principled 
> microbenchmarks covering multi-argument UDFs, projections involving multiple 
> UDFs over different input and output types, etc.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27685) `union` doesn't promote non-nullable columns of struct to nullable

2019-05-14 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-27685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839295#comment-16839295
 ] 

Marco Gaido commented on SPARK-27685:
-

This is a duplicate of SPARK-26812.

> `union` doesn't promote non-nullable columns of struct to nullable
> --
>
> Key: SPARK-27685
> URL: https://issues.apache.org/jira/browse/SPARK-27685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>  Labels: correctness
>
> When doing a {{union}} of two dataframes, a column that is nullable in one of 
> the dataframes will be nullable in the union, promoting the non-nullable one 
> to be nullable. 
> This doesn't happen properly for columns nested as subcolumns of a 
> {{struct}}. It seems to just take the nullability of the first dataframe in 
> the union, meaning a nullable column will become non-nullable, resulting in 
> invalid values.
> {code:scala}
> case class X(x: Option[Long])
> case class Nested(nested: X)
> // First, just work with normal columns
> val df1 = Seq(1L, 2L).toDF("x")
> val df2 = Seq(Some(3L), None).toDF("x")
> df1.printSchema
> // root
> //  |-- x: long (nullable = false)
> df2.printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).as[X].collect
> // res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))
> (df1 union df2).select("*").show
> // ++
> // |   x|
> // ++
> // |   1|
> // |   2|
> // |   3|
> // |null|
> // ++
> // Now, the same with the 'x' column within a struct:
> val struct1 = df1.select(struct('x) as "nested")
> val struct2 = df2.select(struct('x) as "nested")
> struct1.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> struct2.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> // BAD: the x column is not nullable
> (struct1 union struct2).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> // BAD: the last x value became "Some(0)", instead of "None"
> (struct1 union struct2).as[Nested].collect
> // res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), 
> Nested(X(Some(3))), Nested(X(Some(0
> // BAD: showing just the nested columns throws a NPE
> (struct1 union struct2).select("nested.*").show
> // java.lang.NullPointerException
> //  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
> //  at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
> // ...
> //  at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
> //  ... 49 elided
> // Flipping the order makes x nullable as desired
> (struct2 union struct1).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> (struct2 union struct1).as[Y].collect
> // res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), 
> Y(X(Some(2
> (struct2 union struct1).select("nested.*").show
> // ++
> // |   x|
> // ++
> // |   3|
> // |null|
> // |   1|
> // |   2|
> // ++
> {code}
> Note the three {{BAD}} lines, where the union of structs became non-nullable 
> and resulted in invalid values and exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-27685) `union` doesn't promote non-nullable columns of struct to nullable

2019-05-14 Thread Marco Gaido (JIRA)



 [ 
https://issues.apache.org/jira/browse/SPARK-27685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido resolved SPARK-27685.
-
Resolution: Duplicate

> `union` doesn't promote non-nullable columns of struct to nullable
> --
>
> Key: SPARK-27685
> URL: https://issues.apache.org/jira/browse/SPARK-27685
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Huon Wilson
>Priority: Major
>  Labels: correctness
>
> When doing a {{union}} of two dataframes, a column that is nullable in one of 
> the dataframes will be nullable in the union, promoting the non-nullable one 
> to be nullable. 
> This doesn't happen properly for columns nested as subcolumns of a 
> {{struct}}. It seems to just take the nullability of the first dataframe in 
> the union, meaning a nullable column will become non-nullable, resulting in 
> invalid values.
> {code:scala}
> case class X(x: Option[Long])
> case class Nested(nested: X)
> // First, just work with normal columns
> val df1 = Seq(1L, 2L).toDF("x")
> val df2 = Seq(Some(3L), None).toDF("x")
> df1.printSchema
> // root
> //  |-- x: long (nullable = false)
> df2.printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).printSchema
> // root
> //  |-- x: long (nullable = true)
> (df1 union df2).as[X].collect
> // res19: Array[X] = Array(X(Some(1)), X(Some(2)), X(Some(3)), X(None))
> (df1 union df2).select("*").show
> // ++
> // |   x|
> // ++
> // |   1|
> // |   2|
> // |   3|
> // |null|
> // ++
> // Now, the same with the 'x' column within a struct:
> val struct1 = df1.select(struct('x) as "nested")
> val struct2 = df2.select(struct('x) as "nested")
> struct1.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> struct2.printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> // BAD: the x column is not nullable
> (struct1 union struct2).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = false)
> // BAD: the last x value became "Some(0)", instead of "None"
> (struct1 union struct2).as[Nested].collect
> // res23: Array[Nested] = Array(Nested(X(Some(1))), Nested(X(Some(2))), 
> Nested(X(Some(3))), Nested(X(Some(0
> // BAD: showing just the nested columns throws a NPE
> (struct1 union struct2).select("nested.*").show
> // java.lang.NullPointerException
> //  at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificSafeProjection.apply(Unknown
>  Source)
> //  at 
> org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$collectFromPlan$1.apply(Dataset.scala:3387)
> // ...
> //  at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
> //  ... 49 elided
> // Flipping the order makes x nullable as desired
> (struct2 union struct1).printSchema
> // root
> //  |-- nested: struct (nullable = false)
> //  ||-- x: long (nullable = true)
> (struct2 union struct1).as[Y].collect
> // res26: Array[Y] = Array(Y(X(Some(3))), Y(X(None)), Y(X(Some(1))), 
> Y(X(Some(2
> (struct2 union struct1).select("nested.*").show
> // ++
> // |   x|
> // ++
> // |   3|
> // |null|
> // |   1|
> // |   2|
> // ++
> {code}
> Note the three {{BAD}} lines, where the union of structs became non-nullable 
> and resulted in invalid values and exceptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26182) Cost increases when optimizing scalaUDF

2019-05-09 Thread Marco Gaido (JIRA)



[ 
https://issues.apache.org/jira/browse/SPARK-26182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16836486#comment-16836486
 ] 

Marco Gaido commented on SPARK-26182:
-

Actually you just need to mark it {{asNondetermistic}} to avoid the double 
execution.

> Cost increases when optimizing scalaUDF
> ---
>
> Key: SPARK-26182
> URL: https://issues.apache.org/jira/browse/SPARK-26182
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: bupt_ljy
>Priority: Major
>
> Let's assume that we have a udf called splitUDF which outputs a map data.
>  The SQL
> {code:java}
> select
> g['a'], g['b']
> from
>( select splitUDF(x) as g from table) tbl
> {code}
> will be optimized to the same logical plan of
> {code:java}
> select splitUDF(x)['a'], splitUDF(x)['b'] from table
> {code}
> which means that the splitUDF is executed twice instead of once.
> The optimization is from CollapseProject. 
>  I'm not sure whether this is a bug or not. Please tell me if I was wrong 
> about this.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 3 4 5 6 7 8 >

1 - 100 of 778 matches

Mail list logo