date:20200420

[jira] [Commented] (SPARK-21595) introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 breaks existing workflow

2020-04-20 Thread Rakesh Shah (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-21595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088302#comment-17088302
 ] 

Rakesh Shah commented on SPARK-21595:
-

Hi [~sreiling] I am also facing same issue where my shuffle is taking a long 
time,

Here I am joining two tables in spark, but before this point there is a window 
function being used.

When I debugged i saw this below info, when I tried to change the property I am 
not able to change it, it still shows same.

 

20/04/21 04:15:18 INFO Executor: Finished task 935.0 in stage 43.0 (TID 28873). 
26714 bytes result sent to driver
20/04/21 04:18:20 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
20/04/21 04:18:20 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
20/04/21 04:18:20 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
20/04/21 04:20:49 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
20/04/21 04:20:49 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
20/04/21 04:20:49 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
threshold of 4096 rows, switching to org.apache.spark.util.collection.uns

 

can you please help me with this.

> introduction of spark.sql.windowExec.buffer.spill.threshold in spark 2.2 
> breaks existing workflow
> -
>
> Key: SPARK-21595
> URL: https://issues.apache.org/jira/browse/SPARK-21595
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 2.2.0
> Environment: pyspark on linux
>Reporter: Stephan Reiling
>Assignee: Tejas Patil
>Priority: Minor
>  Labels: documentation, regression
> Fix For: 2.2.1, 2.3.0
>
>
> My pyspark code has the following statement:
> {code:java}
> # assign row key for tracking
> df = df.withColumn(
> 'association_idx',
> sqlf.row_number().over(
> Window.orderBy('uid1', 'uid2')
> )
> )
> {code}
> where df is a long, skinny (450M rows, 10 columns) dataframe. So this creates 
> one large window for the whole dataframe to sort over.
> In spark 2.1 this works without problem, in spark 2.2 this fails either with 
> out of memory exception or too many open files exception, depending on memory 
> settings (which is what I tried first to fix this).
> Monitoring the blockmgr, I see that spark 2.1 creates 152 files, spark 2.2 
> creates >110,000 files.
> In the log I see the following messages (110,000 of these):
> {noformat}
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (0  time so far)
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Spilling data because number of 
> spilledRecords crossed the threshold 4096
> 17/08/01 08:55:37 INFO UnsafeExternalSorter: Thread 156 spilling sort data of 
> 64.1 MB to disk (1  time so far)
> {noformat}
> So I started hunting for clues in UnsafeExternalSorter, without luck. What I 
> had missed was this one message:
> {noformat}
> 17/08/01 08:55:37 INFO ExternalAppendOnlyUnsafeRowArray: Reached spill 
> threshold of 4096 rows, switching to 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter
> {noformat}
> Which allowed me to track down the issue. 
> By changing the configuration to include:
> {code:java}
> spark.sql.windowExec.buffer.spill.threshold   2097152
> {code}
> I got it to work again and with the same performance as spark 2.1.
> I have workflows where I use windowing functions that do not fail, but took a 
> performance hit due to the excessive spilling when using the default of 4096.
> I think to make it easier to track down these issues this config variable 
> should be included in the configuration documentation. 
> Maybe 4096 is too small of a default value?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25390) Data source V2 API refactoring

2020-04-20 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-25390:

Summary: Data source V2 API refactoring  (was: data source V2 API 
refactoring)

> Data source V2 API refactoring
> --
>
> Key: SPARK-25390
> URL: https://issues.apache.org/jira/browse/SPARK-25390
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently it's not very clear how we should abstract data source v2 API. The 
> abstraction should be unified between batch and streaming, or similar but 
> have a well-defined difference between batch and streaming. And the 
> abstraction should also include catalog/table.
> An example of the abstraction:
> {code}
> batch: catalog -> table -> scan
> streaming: catalog -> table -> stream -> scan
> {code}
> We should refactor the data source v2 API according to the abstraction



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31494) flatten the result dataframe of ANOVATest

2020-04-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31494:


Assignee: zhengruifeng

> flatten the result dataframe of ANOVATest
> -
>
> Key: SPARK-31494
> URL: https://issues.apache.org/jira/browse/SPARK-31494
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> add a new method
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame {code}
>  
> Similar to new {{test}} method in {{ChiSquareTest}}, it will:
> 1, support df operation on the returned df;
> 2, make driver no longer a bottleneck when dim is high



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31494) flatten the result dataframe of ANOVATest

2020-04-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31494.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28270
[https://github.com/apache/spark/pull/28270]

> flatten the result dataframe of ANOVATest
> -
>
> Key: SPARK-31494
> URL: https://issues.apache.org/jira/browse/SPARK-31494
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> add a new method
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame {code}
>  
> Similar to new {{test}} method in {{ChiSquareTest}}, it will:
> 1, support df operation on the returned df;
> 2, make driver no longer a bottleneck when dim is high



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28554) implement basic catalog functionalities

2020-04-20 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088284#comment-17088284
 ] 

Wenchen Fan commented on SPARK-28554:
-

It still in progress at https://github.com/apache/spark/pull/27345 . I'll get 
back to it after the 3.0 release.

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30624) JDBCV2 with catalog functionalities

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30624?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30624.
-
  Assignee: (was: Wenchen Fan)
Resolution: Duplicate

> JDBCV2 with catalog functionalities
> ---
>
> Key: SPARK-30624
> URL: https://issues.apache.org/jira/browse/SPARK-30624
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30949) Driver cores in kubernetes are coupled with container resources, not spark.driver.cores

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-30949.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 27695
[https://github.com/apache/spark/pull/27695]

> Driver cores in kubernetes are coupled with container resources, not 
> spark.driver.cores
> ---
>
> Key: SPARK-30949
> URL: https://issues.apache.org/jira/browse/SPARK-30949
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Major
> Fix For: 3.1.0
>
>
> Drivers submitted in kubernetes cluster mode set the parallelism of various 
> components like 'RpcEnv', 'MemoryManager', 'BlockManager' from inferring the 
> number of available cores by calling:
> {code:java}
> Runtime.getRuntime().availableProcessors()
> {code}
> By using this, spark applications running on java 8 or older incorrectly get 
> the total number of cores in the host, ignoring the cgroup limits set by 
> kubernetes (https://bugs.openjdk.java.net/browse/JDK-6515172). Java 9 and 
> newer runtimes do not have this problem.
> Orthogonal to this, it is currently not possible to decouple resource limits 
> on the driver container with the amount of parallelism of the various network 
> and memory components listed above.
> My proposal is to use the 'spark.driver.cores' configuration to get the 
> amount of parallelism, like we do for YARN 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2762-L2767).
>  This will enable users to specify 'spark.driver.cores' to set parallelism, 
> and specify 'spark.kubernetes.driver.requests.cores' to limit the resource 
> requests of the driver container. Further, this will remove the need to call 
> 'availableProcessors()', thus the same number of cores will be used for 
> parallelism independent of the java runtime version.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30949) Driver cores in kubernetes are coupled with container resources, not spark.driver.cores

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-30949:
-

Assignee: Onur Satici

> Driver cores in kubernetes are coupled with container resources, not 
> spark.driver.cores
> ---
>
> Key: SPARK-30949
> URL: https://issues.apache.org/jira/browse/SPARK-30949
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Onur Satici
>Assignee: Onur Satici
>Priority: Major
>
> Drivers submitted in kubernetes cluster mode set the parallelism of various 
> components like 'RpcEnv', 'MemoryManager', 'BlockManager' from inferring the 
> number of available cores by calling:
> {code:java}
> Runtime.getRuntime().availableProcessors()
> {code}
> By using this, spark applications running on java 8 or older incorrectly get 
> the total number of cores in the host, ignoring the cgroup limits set by 
> kubernetes (https://bugs.openjdk.java.net/browse/JDK-6515172). Java 9 and 
> newer runtimes do not have this problem.
> Orthogonal to this, it is currently not possible to decouple resource limits 
> on the driver container with the amount of parallelism of the various network 
> and memory components listed above.
> My proposal is to use the 'spark.driver.cores' configuration to get the 
> amount of parallelism, like we do for YARN 
> (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2762-L2767).
>  This will enable users to specify 'spark.driver.cores' to set parallelism, 
> and specify 'spark.kubernetes.driver.requests.cores' to limit the resource 
> requests of the driver container. Further, this will remove the need to call 
> 'availableProcessors()', thus the same number of cores will be used for 
> parallelism independent of the java runtime version.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30420.
-
Fix Version/s: 3.0.0
 Assignee: Terry Kim
   Resolution: Fixed

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-04-20 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088281#comment-17088281
 ] 

Wenchen Fan commented on SPARK-30420:
-

Yea we didn't revert this and we don't need to. Let me close it.

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31502) document identifier in SQL Reference

2020-04-20 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31502:
--

 Summary: document identifier in SQL Reference
 Key: SPARK-31502
 URL: https://issues.apache.org/jira/browse/SPARK-31502
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


document identifier in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31501) AQE update UI should not cause deadlock

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31501:
---

Assignee: Wei Xue

> AQE update UI should not cause deadlock
> ---
>
> Key: SPARK-31501
> URL: https://issues.apache.org/jira/browse/SPARK-31501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31501) AQE update UI should not cause deadlock

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31501.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28275
[https://github.com/apache/spark/pull/28275]

> AQE update UI should not cause deadlock
> ---
>
> Key: SPARK-31501
> URL: https://issues.apache.org/jira/browse/SPARK-31501
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31492) flatten the result dataframe of FValueTest

2020-04-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-31492:


Assignee: zhengruifeng

> flatten the result dataframe of FValueTest
> --
>
> Key: SPARK-31492
> URL: https://issues.apache.org/jira/browse/SPARK-31492
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>
> add a new method
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame {code}
>  
> Similar to new {{test}} method in {{ChiSquareTest}}, it will:
> 1, support df operation on the returned df;
> 2, make driver no longer a bottleneck when dim is high
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31492) flatten the result dataframe of FValueTest

2020-04-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31492.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28268
[https://github.com/apache/spark/pull/28268]

> flatten the result dataframe of FValueTest
> --
>
> Key: SPARK-31492
> URL: https://issues.apache.org/jira/browse/SPARK-31492
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 3.1.0
>
>
> add a new method
> {code:java}
> @Since("3.1.0")
> def test(
> dataset: DataFrame,
> featuresCol: String,
> labelCol: String,
> flatten: Boolean): DataFrame {code}
>  
> Similar to new {{test}} method in {{ChiSquareTest}}, it will:
> 1, support df operation on the returned df;
> 2, make driver no longer a bottleneck when dim is high
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30661) KMeans blockify input vectors

2020-04-20 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-30661.
--
Resolution: Not A Problem

> KMeans blockify input vectors
> -
>
> Key: SPARK-30661
> URL: https://issues.apache.org/jira/browse/SPARK-30661
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31429.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28224
[https://github.com/apache/spark/pull/28224]

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 3.0.0
>
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31429) Add additional fields in ExpressionDescription for more granular category in documentation

2020-04-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-31429:


Assignee: Takeshi Yamamuro

> Add additional fields in ExpressionDescription for more granular category in 
> documentation
> --
>
> Key: SPARK-31429
> URL: https://issues.apache.org/jira/browse/SPARK-31429
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Takeshi Yamamuro
>Priority: Major
>
> Add additional fields in ExpressionDescription so we can have more granular 
> category in function documentation. For example, we want to group window 
> function into finer categories such as ranking functions and analytic 
> functions.
> See Hyukjin's comment below for more details;
> https://github.com/apache/spark/pull/28170#issuecomment-611917191



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28006) User-defined grouped transform pandas_udf for window operations

2020-04-20 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-28006.
--
Resolution: Won't Fix

Let me close this ticket for now as discussed in the PR because we can work 
around via {{groupby().applyInPandas()}}.

> User-defined grouped transform pandas_udf for window operations
> ---
>
> Key: SPARK-28006
> URL: https://issues.apache.org/jira/browse/SPARK-28006
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: Li Jin
>Priority: Major
>
> Currently, pandas_udf supports "grouped aggregate" type that can be used with 
> unbounded and unbounded windows. There is another set of use cases that can 
> benefit from a "grouped transform" type pandas_udf.
> Grouped transform is defined as a N -> N mapping over a group. For example, 
> "compute zscore for values in the group using the grouped mean and grouped 
> stdev", or "rank the values in the group".
> Currently, in order to do this, user needs to use "grouped apply", for 
> example:
> {code:java}
> @pandas_udf(schema, GROUPED_MAP)
> def subtract_mean(pdf)
> v = pdf['v']
> pdf['v'] = v - v.mean()
> return pdf
> df.groupby('id').apply(subtract_mean)
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> This approach has a few downside:
>  * Specifying the full return schema is complicated for the user although the 
> function only changes one column.
>  * The column name 'v' inside as part of the udf, makes the udf less reusable.
>  * The entire dataframe is serialized to pass to Python although only one 
> column is needed.
> Here we propose a new type of pandas_udf to work with these types of use 
> cases:
> {code:java}
> df = spark.createDataFrame(
> [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
> ("id", "v"))
> @pandas_udf('double', GROUPED_XFORM)
> def subtract_mean(v):
> return v - v.mean()
> w = Window.partitionBy('id')
> df = df.withColumn('v', subtract_mean(df['v']).over(w))
> # +---++
> # | id|   v|
> # +---++
> # |  1|-0.5|
> # |  1| 0.5|
> # |  2|-3.0|
> # |  2|-1.0|
> # |  2| 4.0|
> # +---++{code}
> Which addresses the above downsides.
>  * The user only needs to specify the output type of a single column.
>  * The column being zscored is decoupled from the udf implementation
>  * We only need to send one column to Python worker and concat the result 
> with the original dataframe (this is what grouped aggregate is doing already)
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31481) Sprak 3.0 - Full List of Pod Template not available

2020-04-20 Thread Pradeep Misra (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31481?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pradeep Misra updated SPARK-31481:
--
Component/s: Kubernetes

> Sprak 3.0 - Full List of Pod Template not available
> ---
>
> Key: SPARK-31481
> URL: https://issues.apache.org/jira/browse/SPARK-31481
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Kubernetes
>Affects Versions: 3.0.0
>Reporter: Pradeep Misra
>Priority: Minor
>
> A full list of Pod Template Values not available and link is not working
> URL - 
> [https://spark.apache.org/docs/3.0.0-preview/running-on-kubernetes.html#pod-template-properties]
> Section - Pod Template
> Hyper Link Not Working - full_list as shown below
> Link - For details, see the [full 
> list|https://spark.apache.org/docs/3.0.0-preview/running-on-kubernetes.html#pod-template-properties]
>  of pod template values that will be overwritten by spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-04-20 Thread Terry Kim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088186#comment-17088186
 ] 

Terry Kim commented on SPARK-30420:
---

No, I don't think this is reverted. [~cloud_fan], does this need to be reverted?

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30162) Add PushedFilters to metadata in Parquet DSv2 implementation

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30162:
-
Fix Version/s: (was: 3.0.0)

> Add PushedFilters to metadata in Parquet DSv2 implementation
> 
>
> Key: SPARK-30162
> URL: https://issues.apache.org/jira/browse/SPARK-30162
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: pyspark 3.0 preview
> Ubuntu/Centos
> pyarrow 0.14.1 
>Reporter: Nasir Ali
>Assignee: Hyukjin Kwon
>Priority: Minor
> Attachments: Screenshot from 2020-01-01 21-01-18.png, Screenshot from 
> 2020-01-01 21-01-32.png
>
>
> Filters are not pushed down in Spark 3.0 preview. Also the output of 
> "explain" method is different. It is hard to debug in 3.0 whether filters 
> were pushed down or not. Below code could reproduce the bug:
>  
> {code:java}
> // code placeholder
> df = spark.createDataFrame([("usr1",17.00, "2018-03-10T15:27:18+00:00"),
> ("usr1",13.00, "2018-03-11T12:27:18+00:00"),
> ("usr1",25.00, "2018-03-12T11:27:18+00:00"),
> ("usr1",20.00, "2018-03-13T15:27:18+00:00"),
> ("usr1",17.00, "2018-03-14T12:27:18+00:00"),
> ("usr2",99.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",156.00, "2018-03-22T11:27:18+00:00"),
> ("usr2",17.00, "2018-03-31T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-15T11:27:18+00:00"),
> ("usr2",25.00, "2018-03-16T11:27:18+00:00")
> ],
>["user","id", "ts"])
> df = df.withColumn('ts', df.ts.cast('timestamp'))
> df.write.partitionBy("user").parquet("/home/cnali/data/")df2 = 
> spark.read.load("/home/cnali/data/")df2.filter("user=='usr2'").explain(True)
> {code}
> {code:java}
> // Spark 2.4 output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Analyzed Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#40 = usr2)
> +- Relation[id#38,ts#39,user#40] parquet== Optimized Logical Plan ==
> Filter (isnotnull(user#40) && (user#40 = usr2))
> +- Relation[id#38,ts#39,user#40] parquet== Physical Plan ==
> *(1) FileScan parquet [id#38,ts#39,user#40] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/home/cnali/data], PartitionCount: 1, 
> PartitionFilters: [isnotnull(user#40), (user#40 = usr2)], PushedFilters: [], 
> ReadSchema: struct{code}
> {code:java}
> // Spark 3.0.0-preview output
> == Parsed Logical Plan ==
> 'Filter ('user = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Analyzed 
> Logical Plan ==
> id: double, ts: timestamp, user: string
> Filter (user#2 = usr2)
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Optimized 
> Logical Plan ==
> Filter (isnotnull(user#2) AND (user#2 = usr2))
> +- RelationV2[id#0, ts#1, user#2] parquet file:/home/cnali/data== Physical 
> Plan ==
> *(1) Project [id#0, ts#1, user#2]
> +- *(1) Filter (isnotnull(user#2) AND (user#2 = usr2))
>+- *(1) ColumnarToRow
>   +- BatchScan[id#0, ts#1, user#2] ParquetScan Location: 
> InMemoryFileIndex[file:/home/cnali/data], ReadSchema: 
> struct
> {code}
> I have tested it on much larger dataset. Spark 3.0 tries to load whole data 
> and then apply filter. Whereas Spark 2.4 push down the filter. Above output 
> shows that Spark 2.4 applied partition filter but not the Spark 3.0 preview.
>  
> Minor: in Spark 3.0 "explain()" output is truncated (maybe fixed length?) and 
> it's hard to debug.  spark.sql.orc.cache.stripe.details.size=1 doesn't 
> work.
>  
> {code:java}
> // pyspark 3 shell output
> $ pyspark
> Python 3.6.8 (default, Aug  7 2019, 17:28:10) 
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> Warning: Ignoring non-spark config property: 
> java.io.dir=/md2k/data1,/md2k/data2,/md2k/data3,/md2k/data4,/md2k/data5,/md2k/data6,/md2k/data7,/md2k/data8
> 19/12/09 07:05:36 WARN NativeCodeLoader: Unable to load native-hadoop library 
> for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> 19/12/09 07:05:36 WARN SparkConf: Note that spark.local.dir will be 
> overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in 
> mesos/standalone/kubernetes and LOCAL_DIRS in YARN).
> Welcome to
>     __

[jira] [Updated] (SPARK-23367) Include python document style checking

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-23367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-23367:
-
Fix Version/s: (was: 3.0.0)

> Include python document style checking
> --
>
> Key: SPARK-23367
> URL: https://issues.apache.org/jira/browse/SPARK-23367
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Rekha Joshi
>Priority: Minor
>
> As per discussions [PR#20378 |https://github.com/apache/spark/pull/20378] 
> this jira is to include python doc style checking in spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28554) implement basic catalog functionalities

2020-04-20 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088181#comment-17088181
 ] 

Sean R. Owen commented on SPARK-28554:
--

Is this still 'resolved' or (partly) 'reverted'?

> implement basic catalog functionalities
> ---
>
> Key: SPARK-28554
> URL: https://issues.apache.org/jira/browse/SPARK-28554
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30420) Commands involved with namespace go thru the new resolution framework.

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30420:
-
Fix Version/s: (was: 3.0.0)

This was reverted, right?

> Commands involved with namespace go thru the new resolution framework.
> --
>
> Key: SPARK-30420
> URL: https://issues.apache.org/jira/browse/SPARK-30420
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> V2 commands that need to resolve namespace should go thru new resolution 
> framework introduced in 
> [SPARK-30214|https://issues.apache.org/jira/browse/SPARK-30214]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29663) Support sum with interval type values

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29663:
-
Fix Version/s: (was: 3.0.0)

> Support sum with interval type values
> -
>
> Key: SPARK-29663
> URL: https://issues.apache.org/jira/browse/SPARK-29663
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:sql}
> postgres=# SELECT i, Sum(cast(v as interval)) OVER (ORDER BY i ROWS BETWEEN 
> CURRENT ROW AND UNBOUNDED FOLLOWING) FROM (VALUES(1,'1 sec'),(2,'2 
> sec'),(3,NULL),(4,NULL)) t(i,v); 
> i | sum ---+-- 
> 1 | 00:00:03 
> 2 | 00:00:02
> 3 | 
> 4 | 
> (4 rows)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29688) Support average with interval type values

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-29688:
-
Fix Version/s: (was: 3.0.0)

> Support average with interval type values
> -
>
> Key: SPARK-29688
> URL: https://issues.apache.org/jira/browse/SPARK-29688
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> add average aggegate support for spark



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30659) LogisticRegression blockify input vectors

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30659:
-
Fix Version/s: (was: 3.0.0)

> LogisticRegression blockify input vectors
> -
>
> Key: SPARK-30659
> URL: https://issues.apache.org/jira/browse/SPARK-30659
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30642) LinearSVC blockify input vectors

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30642:
-
Fix Version/s: (was: 3.0.0)

> LinearSVC blockify input vectors
> 
>
> Key: SPARK-30642
> URL: https://issues.apache.org/jira/browse/SPARK-30642
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30535) Migrate ALTER TABLE commands to the new resolution framework

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30535?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30535:
-
Fix Version/s: (was: 3.0.0)

> Migrate ALTER TABLE commands to the new resolution framework
> 
>
> Key: SPARK-30535
> URL: https://issues.apache.org/jira/browse/SPARK-30535
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Priority: Major
>
> Migrate ALTER TABLE commands to the new resolution framework introduced in 
> SPARK-30214



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30660) LinearRegression blockify input vectors

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30660:
-
Fix Version/s: (was: 3.0.0)

> LinearRegression blockify input vectors
> ---
>
> Key: SPARK-30660
> URL: https://issues.apache.org/jira/browse/SPARK-30660
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31158) Connect Spark Cluster Exception on IDEA develop

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31158.
--
Resolution: Invalid

This is also not enough information to evaluate the cause

> Connect Spark Cluster Exception on IDEA develop
> ---
>
> Key: SPARK-31158
> URL: https://issues.apache.org/jira/browse/SPARK-31158
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.5.1
> Environment: IDEA+Spark1.5.1
> Spark1.5.1  three node on VM
>Reporter: Yang Ren
>Priority: Major
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> Connect Spark Cluster Exception on IDEA develop：
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkMaster@192.168.159.129:7077] has failed, address is now 
> gated for [5000] ms. Reason: [Disassociated] 
>  ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread 
> Thread[appclient-registration-retry-thread,5,main]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31158) Connect Spark Cluster Exception on IDEA develop

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31158:
-
Fix Version/s: (was: 1.5.1)
   Labels:   (was: test)

Nor Fix version. We wouldn't consider a JIRA that (only) affects EOL versions 
at 1.5 is very old.

> Connect Spark Cluster Exception on IDEA develop
> ---
>
> Key: SPARK-31158
> URL: https://issues.apache.org/jira/browse/SPARK-31158
> Project: Spark
>  Issue Type: Bug
>  Components: Project Infra
>Affects Versions: 1.5.1
> Environment: IDEA+Spark1.5.1
> Spark1.5.1  three node on VM
>Reporter: Yang Ren
>Priority: Major
>   Original Estimate: 2m
>  Remaining Estimate: 2m
>
> Connect Spark Cluster Exception on IDEA develop：
> WARN ReliableDeliverySupervisor: Association with remote system 
> [akka.tcp://sparkMaster@192.168.159.129:7077] has failed, address is now 
> gated for [5000] ms. Reason: [Disassociated] 
>  ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread 
> Thread[appclient-registration-retry-thread,5,main]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31235) Separates different categories of applications

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31235:
-
Fix Version/s: (was: 3.0.0)

(Don't set Fix Version)

> Separates different categories of applications
> --
>
> Key: SPARK-31235
> URL: https://issues.apache.org/jira/browse/SPARK-31235
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: wangzhun
>Priority: Minor
>
> The current application defaults to the SPARK type. 
> In fact, different types of applications have different characteristics and 
> are suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
> I recommend distinguishing them by the parameter `spark.yarn.applicationType` 
> so that we can more easily manage and maintain different types of 
> applications.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-20 Thread Eren Avsarogullari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Attachment: (was: improved_version.json)

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-20 Thread Eren Avsarogullari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Attachment: improved_version.json

> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31501) AQE update UI should not cause deadlock

2020-04-20 Thread Wei Xue (Jira)

Wei Xue created SPARK-31501:
---

 Summary: AQE update UI should not cause deadlock
 Key: SPARK-31501
 URL: https://issues.apache.org/jira/browse/SPARK-31501
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wei Xue






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31389) Add codegen-on test coverage for some tests in SQLMetricsSuite

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31389:
--
Summary: Add codegen-on test coverage for some tests in SQLMetricsSuite  
(was: Ensure all tests in SQLMetricsSuite run with both codegen on and off)

> Add codegen-on test coverage for some tests in SQLMetricsSuite
> --
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Srinivas Rishindra Pothireddi
>Priority: Minor
> Fix For: 3.1.0
>
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  
> SortMergeJoin(left-anti) metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-31389:
--
Issue Type: Improvement  (was: Bug)

> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Srinivas Rishindra Pothireddi
>Priority: Minor
> Fix For: 3.1.0
>
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  
> SortMergeJoin(left-anti) metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31389:
-

Assignee: Srinivas Rishindra Pothireddi

> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Srinivas Rishindra Pothireddi
>Priority: Minor
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  
> SortMergeJoin(left-anti) metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31440) Improve SQL Rest API

2020-04-20 Thread Eren Avsarogullari (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eren Avsarogullari updated SPARK-31440:
---
Description: 
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
1- Support Physical Operations and group metrics per physical operation by 
aligning Spark UI.
2- Support *wholeStageCodegenId* for Physical Operations
3- *nodeId* can be useful for grouping metrics and sorting physical operations 
(according to execution order) to differentiate same operators (if used 
multiple times during the same query execution) and their metrics.
4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, Spark 
UI does not show empty metrics.
5- Remove line breakers(*\n*) from *metricValue*.
6- *planDescription* can be *optional* Http parameter to avoid network cost 
where there is specially complex jobs creating big-plans.
7- *metrics* attribute needs to be exposed at the bottom order as 
*metricDetails*. Specially, this can be useful for the user where 
*metricDetails* array size is high. 
8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
Physical Operators' execution order.

*Attachments:*
 Please find both *current* and *improved* versions of the results as attached 
for following SQL Rest Endpoint:
{code:java}
curl -X GET 
http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
 
  

  was:
SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
apply following improvements on SQL Rest API by aligning Spark-UI.

*Proposed Improvements:*
 1- Support Physical Operations and group metrics per operation by aligning 
Spark UI.
 2- *nodeId* can be useful for grouping metrics as well as for sorting and to 
differentiate same operators and their metrics.
 3- Filter *blank* metrics by aligning with Spark UI - SQL Tab
 4- Remove *\n* from *metricValue(s)*
 5- *planDescription* can be optional Http parameter to avoid network cost 
(specially for complex jobs creating big-plans).
 6- *metrics* attribute needs to be exposed at the bottom order as 
*metricDetails*. This order matches with Spark UI by highlighting with 
execution order.

*Attachments:*
 Please find both *current* and *improved* versions of the results as attached 
for following SQL Rest Endpoint:
{code:java}
curl -X GET 
http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
 
 


> Improve SQL Rest API
> 
>
> Key: SPARK-31440
> URL: https://issues.apache.org/jira/browse/SPARK-31440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Eren Avsarogullari
>Priority: Major
> Attachments: current_version.json, improved_version.json
>
>
> SQL Rest API exposes query execution metrics as Public API. This Jira aims to 
> apply following improvements on SQL Rest API by aligning Spark-UI.
> *Proposed Improvements:*
> 1- Support Physical Operations and group metrics per physical operation by 
> aligning Spark UI.
> 2- Support *wholeStageCodegenId* for Physical Operations
> 3- *nodeId* can be useful for grouping metrics and sorting physical 
> operations (according to execution order) to differentiate same operators (if 
> used multiple times during the same query execution) and their metrics.
> 4- Filter *empty* metrics by aligning with Spark UI - SQL Tab. Currently, 
> Spark UI does not show empty metrics.
> 5- Remove line breakers(*\n*) from *metricValue*.
> 6- *planDescription* can be *optional* Http parameter to avoid network cost 
> where there is specially complex jobs creating big-plans.
> 7- *metrics* attribute needs to be exposed at the bottom order as 
> *metricDetails*. Specially, this can be useful for the user where 
> *metricDetails* array size is high. 
> 8- Reverse order on *metricDetails* aims to match with Spark UI by supporting 
> Physical Operators' execution order.
> *Attachments:*
>  Please find both *current* and *improved* versions of the results as 
> attached for following SQL Rest Endpoint:
> {code:java}
> curl -X GET 
> http://localhost:4040/api/v1/applications/$appId/sql/$executionId?details=true{code}
>  
>   



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31389) Ensure all tests in SQLMetricsSuite run with both codegen on and off

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31389.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28173
[https://github.com/apache/spark/pull/28173]

> Ensure all tests in SQLMetricsSuite run with both codegen on and off
> 
>
> Key: SPARK-31389
> URL: https://issues.apache.org/jira/browse/SPARK-31389
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 3.1.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Srinivas Rishindra Pothireddi
>Priority: Minor
> Fix For: 3.1.0
>
>
> Many tests in SQLMetricsSuite run only with codegen turned off. Some complex 
> code paths (for example, generated code in "SortMergeJoin metrics") aren't 
> exercised at all. The generated code should be tested as well.
> *List of tests that run with codegen off*
> Filter metrics, SortMergeJoin metrics, SortMergeJoin(outer) metrics, 
> BroadcastHashJoin metrics,  ShuffledHashJoin metrics, 
> BroadcastHashJoin(outer) metrics, BroadcastNestedLoopJoin metrics, 
> BroadcastLeftSemiJoinHash metrics, CartesianProduct metrics,  
> SortMergeJoin(left-anti) metrics
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28684) Hive module support JDK 11

2020-04-20 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088076#comment-17088076
 ] 

Dongjoon Hyun commented on SPARK-28684:
---

Thank you so much, [~yumwang]. I resolved this umbrella JIRA issue as `Done`.

> Hive module support JDK 11
> --
>
> Key: SPARK-28684
> URL: https://issues.apache.org/jira/browse/SPARK-28684
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This is an umbrella JIRA for Hive module to support JDK 11.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-28684) Hive module support JDK 11

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-28684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-28684.
---
Resolution: Done

> Hive module support JDK 11
> --
>
> Key: SPARK-28684
> URL: https://issues.apache.org/jira/browse/SPARK-28684
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> This is an umbrella JIRA for Hive module to support JDK 11.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31452) Do not create partition spec for 0-size partitions

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31452.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28226
[https://github.com/apache/spark/pull/28226]

> Do not create partition spec for 0-size partitions
> --
>
> Key: SPARK-31452
> URL: https://issues.apache.org/jira/browse/SPARK-31452
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29245) CCE during creating HiveMetaStoreClient

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29245.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28148
[https://github.com/apache/spark/pull/28148]

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31381) Upgrade built-in Hive 2.3.6 to 2.3.7

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31381.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28148
[https://github.com/apache/spark/pull/28148]

> Upgrade built-in Hive 2.3.6 to 2.3.7
> 
>
> Key: SPARK-31381
> URL: https://issues.apache.org/jira/browse/SPARK-31381
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> Hive 2.3.7 fixed these issues:
> HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 
> or newer
> HIVE-21980:Parsing time can be high in case of deeply nested subqueries
> HIVE-22249: Support Parquet through HCatalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31381) Upgrade built-in Hive 2.3.6 to 2.3.7

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31381:
-

Assignee: Yuming Wang

> Upgrade built-in Hive 2.3.6 to 2.3.7
> 
>
> Key: SPARK-31381
> URL: https://issues.apache.org/jira/browse/SPARK-31381
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Yuming Wang
>Assignee: Yuming Wang
>Priority: Major
>
> Hive 2.3.7 fixed these issues:
> HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 
> or newer
> HIVE-21980:Parsing time can be high in case of deeply nested subqueries
> HIVE-22249: Support Parquet through HCatalog



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29245) CCE during creating HiveMetaStoreClient

2020-04-20 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-29245:
-

Assignee: Yuming Wang

> CCE during creating HiveMetaStoreClient 
> 
>
> Key: SPARK-29245
> URL: https://issues.apache.org/jira/browse/SPARK-29245
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Dongjoon Hyun
>Assignee: Yuming Wang
>Priority: Major
>
> From `master` branch build, when I try to connect to an external HMS, I hit 
> the following.
> {code}
> 19/09/25 10:58:46 ERROR hive.log: Got exception: java.lang.ClassCastException 
> class [Ljava.lang.Object; cannot be cast to class [Ljava.net.URI; 
> ([Ljava.lang.Object; and [Ljava.net.URI; are in module java.base of loader 
> 'bootstrap')
> java.lang.ClassCastException: class [Ljava.lang.Object; cannot be cast to 
> class [Ljava.net.URI; ([Ljava.lang.Object; and [Ljava.net.URI; are in module 
> java.base of loader 'bootstrap')
>   at 
> org.apache.hadoop.hive.metastore.HiveMetaStoreClient.(HiveMetaStoreClient.java:200)
>   at 
> org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.(SessionHiveMetaStoreClient.java:70)
> {code}
> With HIVE-21508, I can get the following.
> {code}
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 3.0.0-SNAPSHOT
>   /_/
> Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.4)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> sql("show databases").show
> ++
> |databaseName|
> ++
> |  .  |
> ...
> {code}
> With 2.3.7-SNAPSHOT, the following basic tests are tested.
> - SHOW DATABASES / TABLES
> - DESC DATABASE / TABLE
> - CREATE / DROP / USE DATABASE
> - CREATE / DROP / INSERT / LOAD / SELECT TABLE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31234) ResetCommand should not wipe out all configs

2020-04-20 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31234.
-
Resolution: Fixed

> ResetCommand should not wipe out all configs
> 
>
> Key: SPARK-31234
> URL: https://issues.apache.org/jira/browse/SPARK-31234
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.3, 2.2.3, 2.3.4, 2.4.5, 3.1.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Blocker
> Fix For: 2.4.6, 3.0.0
>
>
> Currently, ResetCommand clear all configurations, including sql configs, 
> static sql configs and spark context level configs.
> for example:
> ```
> spark-sql> set xyz=abc;
> xyz   abc
> spark-sql> set;
> spark.app.id  local-1585055396930
> spark.app.nameSparkSQL::10.242.189.214
> spark.driver.host 10.242.189.214
> spark.driver.port 65094
> spark.executor.id driver
> spark.jars
> spark.master  local[*]
> spark.sql.catalogImplementation   hive
> spark.sql.hive.version1.2.1
> spark.submit.deployMode   client
> xyz   abc
> spark-sql> reset;
> spark-sql> set;
> spark-sql> set spark.sql.hive.version;
> spark.sql.hive.version1.2.1
> spark-sql> set spark.app.id;
> spark.app.id  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-20 Thread Eric Wasserman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Wasserman updated SPARK-31500:
---
Affects Version/s: 2.4.5

> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4, 2.4.5
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-04-20 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-17636:

Priority: Critical  (was: Minor)

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Critical
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-20 Thread Eric Wasserman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Wasserman updated SPARK-31500:
---
Description: 
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
 {{import org.apache.spark.sql.expressions.Window}}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
 {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
 {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
 {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
 {{val result = df.select(}}
  collect_set('value).over(win) as "stringSet",
  collect_set('bytes).over(win) as "bytesSet"
 {{)}}
 {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
 {{.show()}}

  was:
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
 {{import org.apache.spark.sql.expressions.Window}}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
 {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
 {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
 {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
 {{val result = df.select(}}
 \{{ collect_set('value).over(win) as "stringSet",}}
 \{{ collect_set('bytes).over(win) as "bytesSet"}}
 {{)}}
 {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
 {{.show()}}


> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>   collect_set('value).over(win) as "stringSet",
>   collect_set('bytes).over(win) as "bytesSet"
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-20 Thread Eric Wasserman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Wasserman updated SPARK-31500:
---
Description: 
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
 {{import org.apache.spark.sql.expressions.Window}}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
 {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
 {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
 {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
 {{val result = df.select(}}
 \{{ collect_set('value).over(win) as "stringSet",}}
 \{{ collect_set('bytes).over(win) as "bytesSet"}}
 {{)}}
 {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
 {{.show()}}

  was:
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
 {{import org.apache.spark.sql.expressions.Window}}{\{case }}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
 {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
 {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
 {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
 {{val result = df.select(}}
 \{{ collect_set('value).over(win) as "stringSet",}}
 \{{ collect_set('bytes).over(win) as "bytesSet"}}
 {{)}}
 {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
 {{.show()}}


> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>  \{{ collect_set('value).over(win) as "stringSet",}}
>  \{{ collect_set('bytes).over(win) as "bytesSet"}}
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-20 Thread Eric Wasserman (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Eric Wasserman updated SPARK-31500:
---
Description: 
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
 {{import org.apache.spark.sql.expressions.Window}}{\{case }}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
 {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
 {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
 {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
 {{val result = df.select(}}
 \{{ collect_set('value).over(win) as "stringSet",}}
 \{{ collect_set('bytes).over(win) as "bytesSet"}}
 {{)}}
 {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
 {{.show()}}

  was:
The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
{{import org.apache.spark.sql.expressions.Window}}{{case }}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
{{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
{{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
{{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
{{val result = df.select(}}
{{ collect_set('value).over(win) as "stringSet",}}
{{ collect_set('bytes).over(win) as "bytesSet"}}
{{)}}
{{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
{{.show()}}
{{}}


> collect_set() of BinaryType returns duplicate elements
> --
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Eric Wasserman
>Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct 
> elements. When the column argument's type is BinayType this is not the case.
>  
> Example:
> {{import org.apache.spark.sql.functions._}}
>  {{import org.apache.spark.sql.expressions.Window}}{\{case }}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
>  {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
>  {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
> makeR("b", "fish")).toDF()}}
>  
> {{// In the example below "bytesSet" erroneously has duplicates but 
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
> "byteSet").show(truncate=false)}}
>  
> {{// The same problem is displayed when using window functions.}}
>  {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
> Window.unboundedFollowing)}}
>  {{val result = df.select(}}
>  \{{ collect_set('value).over(win) as "stringSet",}}
>  \{{ collect_set('bytes).over(win) as "bytesSet"}}
>  {{)}}
>  {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
> size('bytesSet) as "bytesSetSize")}}
>  {{.show()}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31500) collect_set() of BinaryType returns duplicate elements

2020-04-20 Thread Eric Wasserman (Jira)

Eric Wasserman created SPARK-31500:
--

 Summary: collect_set() of BinaryType returns duplicate elements
 Key: SPARK-31500
 URL: https://issues.apache.org/jira/browse/SPARK-31500
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4
Reporter: Eric Wasserman


The collect_set() aggregate function should produce a set of distinct elements. 
When the column argument's type is BinayType this is not the case.

 

Example:

{{import org.apache.spark.sql.functions._}}
{{import org.apache.spark.sql.expressions.Window}}{{case }}

{{case class R(id: String, value: String, bytes: Array[Byte])}}
{{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
{{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), 
makeR("b", "fish")).toDF()}}

 

{{// In the example below "bytesSet" erroneously has duplicates but "stringSet" 
does not (as expected).}}

{{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as 
"byteSet").show(truncate=false)}}

 

{{// The same problem is displayed when using window functions.}}
{{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, 
Window.unboundedFollowing)}}
{{val result = df.select(}}
{{ collect_set('value).over(win) as "stringSet",}}
{{ collect_set('bytes).over(win) as "bytesSet"}}
{{)}}
{{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", 
size('bytesSet) as "bytesSetSize")}}
{{.show()}}
{{}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31475) Broadcast stage in AQE did not timeout

2020-04-20 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31475.
-
Fix Version/s: 3.0.0
 Assignee: Wei Xue
   Resolution: Fixed

> Broadcast stage in AQE did not timeout
> --
>
> Key: SPARK-31475
> URL: https://issues.apache.org/jira/browse/SPARK-31475
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wei Xue
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31172) RecordBinaryComparator Tests failing on Big Endian Platform (s390x)

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-31172:
-
Fix Version/s: (was: 2.4.6)
   (was: 3.1.0)
   (was: 3.0.0)

Please don't set Fix version.

[~jineshpatel] The changes you cite are necessary to fix a bug, and the logic 
you're talking about is on purpose - we need to compare as unsigned in some 
places for example regardless of architecture.

That said I do not have a big-endian machine to test on. I think we'd need more 
detail or a clearer proposal for a problem and fix to evaluate.

> RecordBinaryComparator Tests failing on Big Endian Platform (s390x)
> ---
>
> Key: SPARK-31172
> URL: https://issues.apache.org/jira/browse/SPARK-31172
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5
> Environment: Architecture: Big Endian s390x
> Operating Systems:
>  * RHEL 7.x
>  * RHEL 8.x
>  * Ubuntu 16.04
>  * Ubuntu 18.04
>  * Ubuntu 19.10
>  * SLES 12 SP4 and SLES 12 SP5
>  * SLES 15 SP1
>Reporter: Jinesh Patel
>Priority: Minor
>
> Failing Test Cases in the RecordBinaryComparatorSuite:
>  * testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue
>  * testBinaryComparatorWhenSubtractionCanOverflowLongValue
> Test cases failed after the change related to:
> [Github Pull Request 
> #26548|https://github.com/apache/spark/pull/26548#issuecomment-554645859]
> Test Case: testBinaryComparatorWhenSubtractionIsDivisibleByMaxIntValue
>  * Fails due to changing the compare from `<` to `>` as the test condition 
> (In little endian this is valid when the bytes are reversed, but not for big 
> endian)
> Test Case: testBinaryComparatorWhenSubtractionCanOverflowLongValue
>  * Fails due to using Long.compareUnsigned
>  * Possible Fix: Use signed compare for big endian platforms.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30975) Rename config for spark.<>.memoryOverhead

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30975:
-
Fix Version/s: (was: 2.4.6)

> Rename config for spark.<>.memoryOverhead
> -
>
> Key: SPARK-30975
> URL: https://issues.apache.org/jira/browse/SPARK-30975
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Spark Submit
>Affects Versions: 2.4.5
>Reporter: Miquel Angel Andreu
>Priority: Minor
>
> The configuration for spark was changed recently and we have to keep the 
> consistency in the code, so we need to rename the OverHeadMemory in the code 
> to the new one: {{spark.executor.memoryOverhead}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30975) Rename config for spark.<>.memoryOverhead

2020-04-20 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30975.
--
Resolution: Won't Fix

> Rename config for spark.<>.memoryOverhead
> -
>
> Key: SPARK-30975
> URL: https://issues.apache.org/jira/browse/SPARK-30975
> Project: Spark
>  Issue Type: Task
>  Components: Documentation, Spark Submit
>Affects Versions: 2.4.5
>Reporter: Miquel Angel Andreu
>Priority: Minor
>
> The configuration for spark was changed recently and we have to keep the 
> consistency in the code, so we need to rename the OverHeadMemory in the code 
> to the new one: {{spark.executor.memoryOverhead}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6305) Add support for log4j 2.x to Spark

2020-04-20 Thread nirav patel (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087908#comment-17087908
 ] 

nirav patel commented on SPARK-6305:


I think spark should decouple logging and allow application owner to select any 
Slf4J implementations (log4j2, logback) . log4j1 is ridiculous. I am struggling 
to get the rolling log setup with it. 

> Add support for log4j 2.x to Spark
> --
>
> Key: SPARK-6305
> URL: https://issues.apache.org/jira/browse/SPARK-6305
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Reporter: Tal Sliwowicz
>Priority: Minor
>
> log4j 2 requires replacing the slf4j binding and adding the log4j jars in the 
> classpath. Since there are shaded jars, it must be done during the build.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-31431) CalendarInterval encoder support

2020-04-20 Thread Amandeep (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amandeep updated SPARK-31431:
-
Comment: was deleted

(was: Hi Team,

I would like to work on this improvement, please let me know how to proceed.

Thanks,

Amandeep)

> CalendarInterval encoder support
> 
>
> Key: SPARK-31431
> URL: https://issues.apache.org/jira/browse/SPARK-31431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> CalenderInterval is available to be converted to/from internal Spark SQL 
> representation when it is a member of a Scala's product type e.g tuples/ case 
> class etc but not as a primitive type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31431) CalendarInterval encoder support

2020-04-20 Thread Amandeep (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087881#comment-17087881
 ] 

Amandeep commented on SPARK-31431:
--

Hi Team,

I would like to work on this improvement, please let me know how to proceed.

Thanks,

Amandeep

> CalendarInterval encoder support
> 
>
> Key: SPARK-31431
> URL: https://issues.apache.org/jira/browse/SPARK-31431
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> CalenderInterval is available to be converted to/from internal Spark SQL 
> representation when it is a member of a Scala's product type e.g tuples/ case 
> class etc but not as a primitive type



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30055) Allow configurable restart policy of driver and executor pods

2020-04-20 Thread Ed Mitchell (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087857#comment-17087857
 ] 

Ed Mitchell commented on SPARK-30055:
-

I agree with this. Having Never defaulted limits the flexibility that allows 
Kubernetes to restart pods if they run out of memory or terminate in some 
undefined way.

You can also access logs of previously restarted containers by doing: 
{noformat}
kubectl -n  logs  --previous{noformat}
I understand not wanting to set "Always" to the Executor pod, to allow Spark to 
control graceful termination of executors, but shouldn't we at least set it to 
"OnFailure", to allow OOMKilled executors to come back up?

As far as the driver is concerned, our client mode setup has the driver pod 
living as a deployment, which means the restart policy is Always. No reason we 
can't allow Always or OnFailure in the driver restart policy imo.

> Allow configurable restart policy of driver and executor pods
> -
>
> Key: SPARK-30055
> URL: https://issues.apache.org/jira/browse/SPARK-30055
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Kevin Hogeland
>Priority: Major
>
> The current Kubernetes scheduler hard-codes the restart policy for all pods 
> to be "Never". To restart a failed application, all pods have to be deleted 
> and rescheduled, which is very slow and clears any caches the processes may 
> have built. Spark should allow a configurable restart policy for both drivers 
> and executors for immediate restart of crashed/killed drivers/executors as 
> long as the pods are not evicted. (This is not about eviction resilience, 
> that's described in this issue: SPARK-23980)
> Also, as far as I can tell, there's no reason the executors should be set to 
> never restart. Should that be configurable or should it just be changed to 
> Always?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31463) Enhance JsonDataSource by replacing jackson with simdjson

2020-04-20 Thread Shashanka Balakuntala Srinivasa (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087836#comment-17087836
 ] 

Shashanka Balakuntala Srinivasa commented on SPARK-31463:
-

Hi, Anyone working on this issue? 
If not, can i have some details on the implementation if we are moving from 
jackson to simdjson?

> Enhance JsonDataSource by replacing jackson with simdjson
> -
>
> Key: SPARK-31463
> URL: https://issues.apache.org/jira/browse/SPARK-31463
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3
>Reporter: Steven Moy
>Priority: Minor
>
> I came across this VLDB paper: [https://arxiv.org/pdf/1902.08318.pdf] on how 
> to improve json reading speed. We use Spark to process terabytes of JSON, so 
> we try to find ways to improve JSON parsing speed. 
>  
> [https://lemire.me/blog/2020/03/31/we-released-simdjson-0-3-the-fastest-json-parser-in-the-world-is-even-better/]
>  
> [https://github.com/simdjson/simdjson/issues/93]
>  
> Anyone on the opensource communty interested in leading this effort to 
> integrate simdjson in spark json data source api?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Giri Dandu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087813#comment-17087813
 ] 

Giri Dandu edited comment on SPARK-27913 at 4/20/20, 2:52 PM:
--

[~hyukjin.kwon]

 

with mergeSchema orc option it is still *NOT* working in spark 2.4.5

 
{code:java}
scala> spark.conf.getAllscala> spark.conf.getAllres15: Map[String,String] = 
Map(spark.driver.host -> 192.168.7.124, spark.sql.orc.mergeSchema -> true, 
spark.driver.port -> 54231, spark.repl.class.uri -> 
spark://192.168.7.124:54231/classes, spark.jars -> "", 
spark.repl.class.outputDir -> 
/private/var/folders/6h/nkpdlpcd0h34sq6x2fmz896wt2p_sp/T/spark-373735c9-6837-4734-bb13-e8457848a70e/repl-0852551f-cfa5-4b4a-aa2c-ac129818bbc2,
 spark.app.name -> Spark shell, spark.ui.showConsoleProgress -> true, 
spark.executor.id -> driver, spark.submit.deployMode -> client, spark.master -> 
local[*], spark.home -> /Users/gdandu/Downloads/spark-2.4.5-bin-hadoop2.7, 
spark.sql.catalogImplementation -> hive, spark.app.id -> local-1587393426045)

scala> spark.sql("drop table test_broken_orc");
res16: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) 
stored as orc location '/tmp/test_broken_2'");res17: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into table test_broken_orc select named_struct(\"f1\", 
1)");
res18: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res19: org.apache.spark.sql.DataFrame = [a: struct]

scala> res19.show
+---+|  a|+---+|[1]|+---+

scala> spark.sql("drop table test_broken_orc");
res21: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) stored as orc location '/tmp/test_broken_2'");
res22: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res23: org.apache.spark.sql.DataFrame = [a: struct]

scala> res23.show
20/04/20 10:46:23 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 
5)java.lang.ArrayIndexOutOfBoundsException: 1 at 
org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at

[jira] [Comment Edited] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Giri Dandu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087813#comment-17087813
 ] 

Giri Dandu edited comment on SPARK-27913 at 4/20/20, 2:52 PM:
--

[~hyukjin.kwon]

 

with mergeSchema orc option it is still *NOT* working in spark 2.4.5

 
{code:java}
scala> spark.conf.getAllscala> spark.conf.getAllres15: Map[String,String] = 
Map(spark.driver.host -> 192.168.7.124, spark.sql.orc.mergeSchema -> true, 
spark.driver.port -> 54231, spark.repl.class.uri -> 
spark://192.168.7.124:54231/classes, spark.jars -> "", 
spark.repl.class.outputDir -> 
/private/var/folders/6h/nkpdlpcd0h34sq6x2fmz896wt2p_sp/T/spark-373735c9-6837-4734-bb13-e8457848a70e/repl-0852551f-cfa5-4b4a-aa2c-ac129818bbc2,
 spark.app.name -> Spark shell, spark.ui.showConsoleProgress -> true, 
spark.executor.id -> driver, spark.submit.deployMode -> client, spark.master -> 
local[*], spark.home -> /Users/gdandu/Downloads/spark-2.4.5-bin-hadoop2.7, 
spark.sql.catalogImplementation -> hive, spark.app.id -> local-1587393426045)

scala> spark.sql("drop table test_broken_orc");
res16: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) 
stored as orc location '/tmp/test_broken_2'");
res17: org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into table test_broken_orc select named_struct(\"f1\", 
1)");
res18: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res19: org.apache.spark.sql.DataFrame = [a: struct]

scala> res19.show
+---+|  a|+---+|[1]|+---+

scala> spark.sql("drop table test_broken_orc");
res21: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) stored as orc location '/tmp/test_broken_2'");
res22: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res23: org.apache.spark.sql.DataFrame = [a: struct]

scala> res23.show
20/04/20 10:46:23 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 
5)java.lang.ArrayIndexOutOfBoundsException: 1 at 
org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Giri Dandu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087813#comment-17087813
 ] 

Giri Dandu commented on SPARK-27913:


[~hyukjin.kwon]

 

with mergeSchema orc option it is still *NOT* working in spark 2.4.5

 
{code:java}
scala> spark.conf.getAllscala> spark.conf.getAllres15: Map[String,String] = 
Map(spark.driver.host -> 192.168.7.124, spark.sql.orc.mergeSchema -> true, 
spark.driver.port -> 54231, spark.repl.class.uri -> 
spark://192.168.7.124:54231/classes, spark.jars -> "", 
spark.repl.class.outputDir -> 
/private/var/folders/6h/nkpdlpcd0h34sq6x2fmz896wt2p_sp/T/spark-373735c9-6837-4734-bb13-e8457848a70e/repl-0852551f-cfa5-4b4a-aa2c-ac129818bbc2,
 spark.app.name -> Spark shell, spark.ui.showConsoleProgress -> true, 
spark.executor.id -> driver, spark.submit.deployMode -> client, spark.master -> 
local[*], spark.home -> /Users/gdandu/Downloads/spark-2.4.5-bin-hadoop2.7, 
spark.sql.catalogImplementation -> hive, spark.app.id -> local-1587393426045)

scala> spark.sql("drop table test_broken_orc");res16: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) 
stored as orc location '/tmp/test_broken_2'");res17: 
org.apache.spark.sql.DataFrame = []

scala> spark.sql("insert into table test_broken_orc select named_struct(\"f1\", 
1)");
res18: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res19: org.apache.spark.sql.DataFrame = [a: struct]

scala> res19.show
+---+|  a|+---+|[1]|+---+

scala> spark.sql("drop table test_broken_orc");
res21: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create external table test_broken_orc(a struct) stored as orc location '/tmp/test_broken_2'");
res22: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from test_broken_orc");
res23: org.apache.spark.sql.DataFrame = [a: struct]

scala> res23.show
20/04/20 10:46:23 ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 
5)java.lang.ArrayIndexOutOfBoundsException: 1 at 
org.apache.orc.mapred.OrcStruct.getFieldValue(OrcStruct.java:49) at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:133)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$org$apache$spark$sql$execution$datasources$orc$OrcDeserializer$$newWriter$14.apply(OrcDeserializer.scala:123)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer$$anonfun$2$$anonfun$apply$1.apply(OrcDeserializer.scala:51)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcDeserializer.deserialize(OrcDeserializer.scala:64)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at 
org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anonfun$buildReaderWithPartitionValues$2$$anonfun$apply$7.apply(OrcFileFormat.scala:230)
 at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:104)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source) at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255) 
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247) 
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at 
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346) at 
org.apache.spark.rdd.RDD.iterator(RDD.scala:310) at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at 
org.apache.spark.scheduler.Task.run(Task.scala:123) at 
org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at

[jira] [Commented] (SPARK-27913) Spark SQL's native ORC reader implements its own schema evolution

2020-04-20 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087740#comment-17087740
 ] 

Hyukjin Kwon commented on SPARK-27913:
--

[~giri.dandu] can you try {{mergeSchema}} option implemented at SPARK-11412?

> Spark SQL's native ORC reader implements its own schema evolution
> -
>
> Key: SPARK-27913
> URL: https://issues.apache.org/jira/browse/SPARK-27913
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.3
>Reporter: Owen O'Malley
>Priority: Major
>
> ORC's reader handles a wide range of schema evolution, but the Spark SQL 
> native ORC bindings do not provide the desired schema to the ORC reader. This 
> causes a regression when moving spark.sql.orc.impl from 'hive' to 'native'.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31499) Add more column entries in bulit-in function tables of SQL references

2020-04-20 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31499:
-
Summary: Add more column entries in bulit-in function tables of SQL 
references  (was: Add more field entries in bulit-in function tables of SQL 
references)

> Add more column entries in bulit-in function tables of SQL references
> -
>
> Key: SPARK-31499
> URL: https://issues.apache.org/jira/browse/SPARK-31499
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> In SPARK-31429, we are planning to automatically generate a list of built-in 
> functions in SQL references. In the PR, only function names and descriptions 
> are shown in a function table. So, we could improve the table by adding more 
> fields (e.g., input types, arguments, ) in the table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31499) Add more field entries in bulit-in function tables in SQL references

2020-04-20 Thread Takeshi Yamamuro (Jira)

Takeshi Yamamuro created SPARK-31499:


 Summary: Add more field entries in bulit-in function tables in SQL 
references
 Key: SPARK-31499
 URL: https://issues.apache.org/jira/browse/SPARK-31499
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.1.0
Reporter: Takeshi Yamamuro


In SPARK-31429, we are planning to automatically generate a list of built-in 
functions in SQL references. In the PR, only function names and descriptions 
are shown in a function table. So, we could improve the table by adding more 
fields (e.g., input types, arguments, ) in the table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31499) Add more field entries in bulit-in function tables of SQL references

2020-04-20 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-31499:
-
Summary: Add more field entries in bulit-in function tables of SQL 
references  (was: Add more field entries in bulit-in function tables in SQL 
references)

> Add more field entries in bulit-in function tables of SQL references
> 
>
> Key: SPARK-31499
> URL: https://issues.apache.org/jira/browse/SPARK-31499
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SQL
>Affects Versions: 3.1.0
>Reporter: Takeshi Yamamuro
>Priority: Major
>
> In SPARK-31429, we are planning to automatically generate a list of built-in 
> functions in SQL references. In the PR, only function names and descriptions 
> are shown in a function table. So, we could improve the table by adding more 
> fields (e.g., input types, arguments, ) in the table.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31498) Dump public static sql configurations through doc generation

2020-04-20 Thread Kent Yao (Jira)

Kent Yao created SPARK-31498:


 Summary: Dump public static sql configurations through doc 
generation
 Key: SPARK-31498
 URL: https://issues.apache.org/jira/browse/SPARK-31498
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SQL
Affects Versions: 3.0.0, 3.1.0
Reporter: Kent Yao


Currently, only the non-static public SQL configurations are dump to public 
doc, we'd better also add those static public ones as the command set -v



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu reassigned SPARK-31497:
--

Assignee: Weichen Xu

> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {code}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code:python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-31497:
---
Description: 
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=BinaryClassificationEvaluator(),
  numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}


  was:
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=BinaryClassificationEvaluator(),
  numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{python}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}



> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
>

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-31497:
---
Description: 
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=BinaryClassificationEvaluator(),
  numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code:python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}


  was:
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=BinaryClassificationEvaluator(),
  numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{code}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}



> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
>

[jira] [Updated] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-20 Thread Weichen Xu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Weichen Xu updated SPARK-31497:
---
Description: 
Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.

Reproduce code run in pyspark shell:

1) Train model and save model in pyspark:
{code:python}

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
ParamGridBuilder

training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
(8, "e spark program", 1.0),
(9, "a e c l", 0.0),
(10, "spark compile", 1.0),
(11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, 
hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
  estimatorParamMaps=paramGrid,
  evaluator=BinaryClassificationEvaluator(),
  numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
{python}

2): Train crossvalidation model in scala with similar code above, and save to 
'/tmp/model_cv_scala001', run following code in pyspark:
{code: python}
from pyspark.ml.tuning import CrossValidatorModel
CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
{code}


  was:Pyspark CrossValidator/TrainValidationSplit with pipeline estimator 
cannot save and load model.


> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model
> --
>
> Key: SPARK-31497
> URL: https://issues.apache.org/jira/browse/SPARK-31497
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.4.5
>Reporter: Weichen Xu
>Priority: Major
>
> Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot 
> save and load model.
> Reproduce code run in pyspark shell:
> 1) Train model and save model in pyspark:
> {code:python}
> from pyspark.ml import Pipeline
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.evaluation import BinaryClassificationEvaluator
> from pyspark.ml.feature import HashingTF, Tokenizer
> from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, 
> ParamGridBuilder
> training = spark.createDataFrame([
> (0, "a b c d e spark", 1.0),
> (1, "b d", 0.0),
> (2, "spark f g h", 1.0),
> (3, "hadoop mapreduce", 0.0),
> (4, "b spark who", 1.0),
> (5, "g d a y", 0.0),
> (6, "spark fly", 1.0),
> (7, "was mapreduce", 0.0),
> (8, "e spark program", 1.0),
> (9, "a e c l", 0.0),
> (10, "spark compile", 1.0),
> (11, "hadoop software", 0.0)
> ], ["id", "text", "label"])
> # Configure an ML pipeline, which consists of tree stages: tokenizer, 
> hashingTF, and lr.
> tokenizer = Tokenizer(inputCol="text", outputCol="words")
> hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
> lr = LogisticRegression(maxIter=10)
> pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
> paramGrid = ParamGridBuilder() \
> .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
> .addGrid(lr.regParam, [0.1, 0.01]) \
> .build()
> crossval = CrossValidator(estimator=pipeline,
>   estimatorParamMaps=paramGrid,
>   evaluator=BinaryClassificationEvaluator(),
>   numFolds=2)  # use 3+ folds in practice
> # Run cross-validation, and choose the best set of parameters.
> cvModel = crossval.fit(training)
> cvModel.save('/tmp/cv_model001') # save model failed. Rase error.
> {python}
> 2): Train crossvalidation model in scala with similar code above, and save to 
> '/tmp/model_cv_scala001', run following code in pyspark:
> {code: python}
> from pyspark.ml.tuning import CrossValidatorModel
> CrossValidatorModel.load('/tmp/model_cv_scala001') # raise error
> {code}



--

[jira] [Created] (SPARK-31497) Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model

2020-04-20 Thread Weichen Xu (Jira)

Weichen Xu created SPARK-31497:
--

 Summary: Pyspark CrossValidator/TrainValidationSplit with pipeline 
estimator cannot save and load model
 Key: SPARK-31497
 URL: https://issues.apache.org/jira/browse/SPARK-31497
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.4.5
Reporter: Weichen Xu


Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save 
and load model.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-20 Thread Tomas Shestakov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Shestakov updated SPARK-31496:

Description: 
Local spark with one core (local[1]) while trying to save Dataset to parquet 
local file cause OOM. 
{code:java}
SparkSession sparkSession = SparkSession.builder()
.appName("Loader impl test")
.master("local[1]")
.config("spark.ui.enabled", false)
.config("spark.sql.datetime.java8API.enabled", true)
.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "1g")
.config("spark.executor.memory", "4g")
.config("spark.driver.memory", "8g")
.getOrCreate();

{code}
{noformat}
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
org.apache.spark.SparkContext:57] q: - Starting job: save at LoaderImpl.java:305
[20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
LoaderImpl.java:305) with 1 output partitions
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
(save at LoaderImpl.java:305)
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: List()
[20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
(MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
parents
[20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 in 
memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
[20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
DAGScheduler.scala:1206
[20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) (first 
15 tasks are for partitions Vector(0))
[20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 with 
1 tasks
Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError at 
java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
 at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119) at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
 at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) 
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
 at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) at 
org.apache.spark.util.Utils$$anon$2.write(Utils.scala:153) at 
com.esotericsoftware.kryo.io.Output.flush(Output.java:185) at 
com.esotericsoftware.kryo.io.Output.close(Output.java:196) at

[jira] [Updated] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-20 Thread Tomas Shestakov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Shestakov updated SPARK-31496:

Description: 
Local spark with one core (local[1]) while trying to save Dataset to parquet 
local file cause OOM. 
{code:java}
SparkSession sparkSession = SparkSession.builder()
.appName("Loader impl test")
.master("local[1]")
.config("spark.ui.enabled", false)
.config("spark.sql.datetime.java8API.enabled", true)
.config("spark.serializer", 
"org.apache.spark.serializer.KryoSerializer")
.config("spark.kryoserializer.buffer.max", "1g")
.config("spark.executor.memory", "4g")
.config("spark.driver.memory", "8g")
.getOrCreate();
{code}
{noformat}
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
org.apache.spark.SparkContext:57] q: - Starting job: save at LoaderImpl.java:305
[20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
LoaderImpl.java:305) with 1 output partitions
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
(save at LoaderImpl.java:305)
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: List()
[20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
(MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
parents
[20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 in 
memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
[20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
DAGScheduler.scala:1206
[20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) (first 
15 tasks are for partitions Vector(0))
[20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 with 
1 tasks
Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError at 
java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
 at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119) at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
 at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) 
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
 at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) at 
org.apache.spark.util.Utils$$anon$2.write(Utils.scala:153) at 
com.esotericsoftware.kryo.io.Output.flush(Output.java:185) at 
com.esotericsoftware.kryo.io.Output.close(Output.java:196) at

[jira] [Updated] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-20 Thread Tomas Shestakov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Shestakov updated SPARK-31496:

Description: 
Local spark with one core (local[1]) while trying to save Dataset to parquet 
local file cause OOM
{noformat}
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
org.apache.spark.SparkContext:57] q: - Starting job: save at LoaderImpl.java:305
[20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
LoaderImpl.java:305) with 1 output partitions
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
(save at LoaderImpl.java:305)
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: List()
[20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
(MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
parents
[20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 in 
memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
[20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
DAGScheduler.scala:1206
[20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) (first 
15 tasks are for partitions Vector(0))
[20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 with 
1 tasks
Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError at 
java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
 at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119) at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
 at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) 
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
 at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) at 
org.apache.spark.util.Utils$$anon$2.write(Utils.scala:153) at 
com.esotericsoftware.kryo.io.Output.flush(Output.java:185) at 
com.esotericsoftware.kryo.io.Output.close(Output.java:196) at 
org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:273)
 at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:158) at 
org.apache.spark.rdd.ParallelCollectionPartition.$anonfun$writeObject$1(ParallelCollectionRDD.scala:65)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at 
org.apache.spark.rdd.ParallelCollectionPartition.writeObject(ParallelCollectionRDD.scala:51)
 at

[jira] [Updated] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-20 Thread Tomas Shestakov (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomas Shestakov updated SPARK-31496:

Description: 
Local spark with one core (local[1]) while trying to save Dataset to parquet 
local file cause OOM
{noformat}
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.967]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.969]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using user defined output 
committer class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:27.970]  INFO [boundedElastic-2 
o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - File Output Committer 
Algorithm version is 1
[20-Apr-2020 11:42:27.973]  INFO [boundedElastic-2 
o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using output committer 
class org.apache.parquet.hadoop.ParquetOutputCommitter
[20-Apr-2020 11:42:34.371]  INFO [boundedElastic-2 
org.apache.spark.SparkContext:57] q: - Starting job: save at LoaderImpl.java:305
[20-Apr-2020 11:42:34.389]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Got job 0 (save at 
LoaderImpl.java:305) with 1 output partitions
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
(save at LoaderImpl.java:305)
[20-Apr-2020 11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Parents of final stage: List()
[20-Apr-2020 11:42:34.392]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Missing parents: 
List()[20-Apr-2020 11:42:34.398]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting ResultStage 0 
(MapPartitionsRDD[6] at save at LoaderImpl.java:305), which has no missing 
parents
[20-Apr-2020 11:42:34.634]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0 stored 
as values in memory (estimated size 166.1 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)
[20-Apr-2020 11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 in 
memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)
[20-Apr-2020 11:42:34.953]  INFO [dag-scheduler-event-loop 
org.apache.spark.SparkContext:57] q: - Created broadcast 0 from broadcast at 
DAGScheduler.scala:1206
[20-Apr-2020 11:42:34.980]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) (first 
15 tasks are for partitions Vector(0))
[20-Apr-2020 11:42:34.981]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - Adding task set 0.0 with 
1 tasksException in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError 
at 
java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
 at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119) at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
 at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) 
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
 at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) at 
org.apache.spark.util.Utils$$anon$2.write(Utils.scala:153) at 
com.esotericsoftware.kryo.io.Output.flush(Output.java:185) at 
com.esotericsoftware.kryo.io.Output.close(Output.java:196) at 
org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:273)
 at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:158) at 
org.apache.spark.rdd.ParallelCollectionPartition.$anonfun$writeObject$1(ParallelCollectionRDD.scala:65)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) at 
org.apache.spark.rdd.ParallelCollectionPartition.writeObject(ParallelCollectionRDD.scala:51)
 at

[jira] [Created] (SPARK-31496) Exception in thread "dispatcher-event-loop-1" java.lang.OutOfMemoryError

2020-04-20 Thread Tomas Shestakov (Jira)

Tomas Shestakov created SPARK-31496:
---

 Summary: Exception in thread "dispatcher-event-loop-1" 
java.lang.OutOfMemoryError
 Key: SPARK-31496
 URL: https://issues.apache.org/jira/browse/SPARK-31496
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Windows 10 (1909)

JDK 11.0.6

spark-3.0.0-preview2-bin-hadoop3.2

local[1]

 

 
Reporter: Tomas Shestakov


Local spark with one core (local[1]) while trying to save Dataset to parquet 
local file cause OOM
{noformat}
[20-Apr-2020 11:42:27.877]  INFO [boundedElastic-2 
o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - Using default output 
committer for Parquet: 
org.apache.parquet.hadoop.ParquetOutputCommitter[20-Apr-2020 11:42:27.877]  
INFO [boundedElastic-2 o.a.s.s.e.datasources.parquet.ParquetFileFormat:57] q: - 
Using default output committer for Parquet: 
org.apache.parquet.hadoop.ParquetOutputCommitter[20-Apr-2020 11:42:27.967]  
INFO [boundedElastic-2 o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - 
File Output Committer Algorithm version is 1[20-Apr-2020 11:42:27.969]  INFO 
[boundedElastic-2 o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using 
user defined output committer class 
org.apache.parquet.hadoop.ParquetOutputCommitter[20-Apr-2020 11:42:27.970]  
INFO [boundedElastic-2 o.a.h.mapreduce.lib.output.FileOutputCommitter:108] q: - 
File Output Committer Algorithm version is 1[20-Apr-2020 11:42:27.973]  INFO 
[boundedElastic-2 o.a.s.s.e.d.SQLHadoopMapReduceCommitProtocol:57] q: - Using 
output committer class 
org.apache.parquet.hadoop.ParquetOutputCommitter[20-Apr-2020 11:42:34.371]  
INFO [boundedElastic-2 org.apache.spark.SparkContext:57] q: - Starting job: 
save at LoaderImpl.java:305[20-Apr-2020 11:42:34.389]  INFO 
[dag-scheduler-event-loop org.apache.spark.scheduler.DAGScheduler:57] q: - Got 
job 0 (save at LoaderImpl.java:305) with 1 output partitions[20-Apr-2020 
11:42:34.390]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Final stage: ResultStage 0 
(save at LoaderImpl.java:305)[20-Apr-2020 11:42:34.390]  INFO 
[dag-scheduler-event-loop org.apache.spark.scheduler.DAGScheduler:57] q: - 
Parents of final stage: List()[20-Apr-2020 11:42:34.392]  INFO 
[dag-scheduler-event-loop org.apache.spark.scheduler.DAGScheduler:57] q: - 
Missing parents: List()[20-Apr-2020 11:42:34.398]  INFO 
[dag-scheduler-event-loop org.apache.spark.scheduler.DAGScheduler:57] q: - 
Submitting ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305), 
which has no missing parents[20-Apr-2020 11:42:34.634]  INFO 
[dag-scheduler-event-loop org.apache.spark.storage.memory.MemoryStore:57] q: - 
Block broadcast_0 stored as values in memory (estimated size 166.1 KiB, free 
18.4 GiB)[20-Apr-2020 11:42:34.945]  INFO [dag-scheduler-event-loop 
org.apache.spark.storage.memory.MemoryStore:57] q: - Block broadcast_0_piece0 
stored as bytes in memory (estimated size 58.0 KiB, free 18.4 GiB)[20-Apr-2020 
11:42:34.949]  INFO [dispatcher-BlockManagerMaster 
org.apache.spark.storage.BlockManagerInfo:57] q: - Added broadcast_0_piece0 in 
memory on DESKTOP-A1:58276 (size: 58.0 KiB, free: 18.4 GiB)[20-Apr-2020 
11:42:34.953]  INFO [dag-scheduler-event-loop org.apache.spark.SparkContext:57] 
q: - Created broadcast 0 from broadcast at DAGScheduler.scala:1206[20-Apr-2020 
11:42:34.980]  INFO [dag-scheduler-event-loop 
org.apache.spark.scheduler.DAGScheduler:57] q: - Submitting 1 missing tasks 
from ResultStage 0 (MapPartitionsRDD[6] at save at LoaderImpl.java:305) (first 
15 tasks are for partitions Vector(0))[20-Apr-2020 11:42:34.981]  INFO 
[dag-scheduler-event-loop org.apache.spark.scheduler.TaskSchedulerImpl:57] q: - 
Adding task set 0.0 with 1 tasksException in thread "dispatcher-event-loop-1" 
java.lang.OutOfMemoryError at 
java.base/java.io.ByteArrayOutputStream.hugeCapacity(ByteArrayOutputStream.java:125)
 at 
java.base/java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:119) at 
java.base/java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:95)
 at 
java.base/java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:156) 
at 
org.apache.spark.util.ByteBufferOutputStream.write(ByteBufferOutputStream.scala:41)
 at 
java.base/java.io.ObjectOutputStream$BlockDataOutputStream.write(ObjectOutputStream.java:1859)
 at java.base/java.io.ObjectOutputStream.write(ObjectOutputStream.java:712) at 
org.apache.spark.util.Utils$$anon$2.write(Utils.scala:153) at 
com.esotericsoftware.kryo.io.Output.flush(Output.java:185) at 
com.esotericsoftware.kryo.io.Output.close(Output.java:196) at 
org.apache.spark.serializer.KryoSerializationStream.close(KryoSerializer.scala:273)
 at org.apache.spark.util.Utils$.serializeViaNestedStream(Utils.scala:158) at

[jira] [Created] (SPARK-31495) Support formatted explain for Adaptive Query Execution

2020-04-20 Thread wuyi (Jira)

wuyi created SPARK-31495:


 Summary: Support formatted explain for Adaptive Query Execution
 Key: SPARK-31495
 URL: https://issues.apache.org/jira/browse/SPARK-31495
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: wuyi


Support formatted explain for AQE



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-28367) Kafka connector infinite wait because metadata never updated

2020-04-20 Thread Gabor Somogyi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-28367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087498#comment-17087498
 ] 

Gabor Somogyi commented on SPARK-28367:
---

It it unblocked and started to work on this...

> Kafka connector infinite wait because metadata never updated
> 
>
> Key: SPARK-28367
> URL: https://issues.apache.org/jira/browse/SPARK-28367
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.3, 2.2.3, 2.3.3, 2.4.3, 3.0.0
>Reporter: Gabor Somogyi
>Priority: Critical
>
> Spark uses an old and deprecated API named poll(long) which never returns and 
> stays in live lock if metadata is not updated (for instance when broker 
> disappears at consumer creation).
> I've created a small standalone application to test it and the alternatives: 
> https://github.com/gaborgsomogyi/kafka-get-assignment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24432) Add support for dynamic resource allocation

2020-04-20 Thread Shuai Ma (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-24432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17087451#comment-17087451
 ] 

Shuai Ma commented on SPARK-24432:
--

looking forward to seeing this feature come true...

> Add support for dynamic resource allocation
> ---
>
> Key: SPARK-24432
> URL: https://issues.apache.org/jira/browse/SPARK-24432
> Project: Spark
>  Issue Type: New Feature
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Yinan Li
>Priority: Major
>
> This is an umbrella ticket for work on adding support for dynamic resource 
> allocation into the Kubernetes mode. This requires a Kubernetes-specific 
> external shuffle service. The feature is available in our fork at 
> github.com/apache-spark-on-k8s/spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31490) Benchmark conversions to/from Java 8 date-time types

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31490.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28263
[https://github.com/apache/spark/pull/28263]

> Benchmark conversions to/from Java 8 date-time types
> 
>
> Key: SPARK-31490
> URL: https://issues.apache.org/jira/browse/SPARK-31490
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> DATE and TIMESTAMP column values can be converted to java.sql.Date and 
> java.sql.Timestamp (by default), or to Java 8 date-time types 
> java.time.LocalDate and java.time.Instant when 
> spark.sql.datetime.java8API.enabled is set to true. DateTimeBenchmarks misses 
> benchmarks of Java 8 date/timestamps. The ticket aims to fix that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31490) Benchmark conversions to/from Java 8 date-time types

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31490:
---

Assignee: Maxim Gekk

> Benchmark conversions to/from Java 8 date-time types
> 
>
> Key: SPARK-31490
> URL: https://issues.apache.org/jira/browse/SPARK-31490
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.0.0, 3.1.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Minor
>
> DATE and TIMESTAMP column values can be converted to java.sql.Date and 
> java.sql.Timestamp (by default), or to Java 8 date-time types 
> java.time.LocalDate and java.time.Instant when 
> spark.sql.datetime.java8API.enabled is set to true. DateTimeBenchmarks misses 
> benchmarks of Java 8 date/timestamps. The ticket aims to fix that. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31385:
---

Assignee: Maxim Gekk

> Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
> -
>
> Key: SPARK-31385
> URL: https://issues.apache.org/jira/browse/SPARK-31385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to 
> Proleptic Gregorian calendar is not symmetric to opposite conversion for the 
> following time zones:
>  #  Asia/Tehran
>  # Iran
>  # Africa/Casablanca
>  # Africa/El_Aaiun
> Here is the results from the https://github.com/apache/spark/pull/28119:
> Julian -> Gregorian:
> {code:json}
> , {
>   "tz" : "Asia/Tehran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Iran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Africa/Casablanca",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
> 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
> 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
> 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
> 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
> 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
> 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
> 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
> 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 
> 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 
> 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 
> 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 
> 3699828000, 3702855600 ],
>   "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, 
> -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ]
> }, {
>   "tz" : "Africa/El_Aaiun",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
> 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
> 2781136800,

[jira] [Resolved] (SPARK-31385) Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing

2020-04-20 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31385.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28253
[https://github.com/apache/spark/pull/28253]

> Results of Julian-Gregorian rebasing don't match to Gregorian-Julian rebasing
> -
>
> Key: SPARK-31385
> URL: https://issues.apache.org/jira/browse/SPARK-31385
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Microseconds rebasing from the hybrid calendar (Julian + Gregorian) to 
> Proleptic Gregorian calendar is not symmetric to opposite conversion for the 
> following time zones:
>  #  Asia/Tehran
>  # Iran
>  # Africa/Casablanca
>  # Africa/El_Aaiun
> Here is the results from the https://github.com/apache/spark/pull/28119:
> Julian -> Gregorian:
> {code:json}
> , {
>   "tz" : "Asia/Tehran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Iran",
>   "switches" : [ -62135782200, -59006460600, -55850700600, -52694940600, 
> -46383420600, -43227660600, -40071900600, -33760380600, -30604620600, 
> -27448860600, -21137340600, -17981580600, -14825820600, -12219305400, 
> -2208988800, 2547315000, 2547401400 ],
>   "diffs" : [ 173056, 86656, 256, -86144, -172544, -258944, -345344, -431744, 
> -518144, -604544, -690944, -777344, -863744, 256, 0, -3600, 0 ]
> }, {
>   "tz" : "Africa/Casablanca",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091, 2628122400, 2631754800, 2658967200, 2661994800, 
> 2689812000, 2692839600, 2720052000, 2723684400, 2750896800, 2753924400, 
> 2781136800, 2784769200, 2811981600, 2815009200, 2842826400, 2845854000, 
> 2873066400, 2876698800, 2903911200, 2906938800, 2934756000, 2937783600, 
> 2964996000, 2968023600, 2995840800, 2998868400, 3026080800, 3029713200, 
> 3056925600, 3059953200, 3087770400, 3090798000, 3118010400, 3121642800, 
> 3148855200, 3151882800, 317970, 3182727600, 320994, 3212967600, 
> 3240784800, 3243812400, 3271024800, 3274657200, 3301869600, 3304897200, 
> 3332714400, 3335742000, 3362954400, 3366586800, 3393799200, 3396826800, 
> 3424644000, 3427671600, 3454884000, 3457911600, 3485728800, 3488756400, 
> 3515968800, 3519601200, 3546813600, 3549841200, 3577658400, 3580686000, 
> 3607898400, 3611530800, 3638743200, 3641770800, 3669588000, 3672615600, 
> 3699828000, 3702855600 ],
>   "diffs" : [ 174620, 88220, 1820, -84580, -170980, -257380, -343780, 
> -430180, -516580, -602980, -689380, -775780, -862180, 1820, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 
> 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, 
> -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600, 0, -3600 ]
> }, {
>   "tz" : "Africa/El_Aaiun",
>   "switches" : [ -62135769600, -59006448000, -55850688000, -52694928000, 
> -46383408000, -43227648000, -40071888000, -33760368000, -30604608000, 
> -27448848000, -21137328000, -17981568000, -14825808000, -12219292800, 
> -2208988800, 2141866800, 2169079200, 2172106800, 2199924000, 2202951600, 
> 2230164000, 2233796400, 2261008800, 2264036400, 2291248800, 2294881200, 
> 2322093600, 2325121200, 2352938400, 2355966000, 2383178400, 2386810800, 
> 2414023200, 2417050800, 2444868000, 2447895600, 2475108000, 2478740400, 
> 2505952800, 2508980400, 2536192800, 2539825200, 2567037600, 2570065200, 
> 2597882400, 260091,

90 matches

Mail list logo