date:20200331

[jira] [Resolved] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31318.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28082
[https://github.com/apache/spark/pull/28082]

> Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
> -
>
> Key: SPARK-31318
> URL: https://issues.apache.org/jira/browse/SPARK-31318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark provides 2 SQL configs to control rebasing of 
> dates/timestamps in Parquet and Avro datasource: 
> spark.sql.legacy.parquet.rebaseDateTime.enabled
> spark.sql.legacy.avro.rebaseDateTime.enabled
> The configs control rebasing in read and in write. That's can be inconvenient 
> for users who want to read files saved by Spark 2.4 and earlier versions, and 
> save dates/timestamps without rebasing.
> The ticket aims to split the configs, and introduce separate SQL configs for 
> read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31318:
---

Assignee: Maxim Gekk

> Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
> -
>
> Key: SPARK-31318
> URL: https://issues.apache.org/jira/browse/SPARK-31318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Currently, Spark provides 2 SQL configs to control rebasing of 
> dates/timestamps in Parquet and Avro datasource: 
> spark.sql.legacy.parquet.rebaseDateTime.enabled
> spark.sql.legacy.avro.rebaseDateTime.enabled
> The configs control rebasing in read and in write. That's can be inconvenient 
> for users who want to read files saved by Spark 2.4 and earlier versions, and 
> save dates/timestamps without rebasing.
> The ticket aims to split the configs, and introduce separate SQL configs for 
> read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072386#comment-17072386
 ] 

Dongjoon Hyun commented on SPARK-31308:
---

Ya. I fixed it. It was failed from my side at that time.

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31308.
---
Fix Version/s: 3.1.0
 Assignee: L. C. Hsieh
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28077/

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x

2020-03-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-31313.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28080
[https://github.com/apache/spark/pull/28080]

> Add `m01` node name to support Minikube 1.8.x
> -
>
> Key: SPARK-31313
> URL: https://issues.apache.org/jira/browse/SPARK-31313
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31320) fix release script for 3.0.0

2020-03-31 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-31320:
---

 Summary: fix release script for 3.0.0
 Key: SPARK-31320
 URL: https://issues.apache.org/jira/browse/SPARK-31320
 Project: Spark
  Issue Type: Bug
  Components: Project Infra
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x

2020-03-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-31313:
---

Assignee: Dongjoon Hyun

> Add `m01` node name to support Minikube 1.8.x
> -
>
> Key: SPARK-31313
> URL: https://issues.apache.org/jira/browse/SPARK-31313
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes, Tests
>Affects Versions: 3.1.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31312:

Fix Version/s: 2.4.6

> Transforming Hive simple UDF (using JAR) expression may incur CNFE in later 
> evaluation
> --
>
> Key: SPARK-31312
> URL: https://issues.apache.org/jira/browse/SPARK-31312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0, 2.4.6
>
>
> In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of 
> current thread context classloader.
> [~cloud_fan] pointed out another potential issue in post-review of 
> SPARK-26560 - quoting the comment:
> {quote}
> Found a potential problem: here we call HiveSimpleUDF.dateType (which is a 
> lazy val), to force to load the class with the corrected class loader.
> However, if the expression gets transformed later, which copies 
> HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class 
> loading, and at that time there is no guarantee that the corrected 
> classloader is used.
> I think we should materialize the loaded class in HiveSimpleUDF.
> {quote}
> This JIRA issue is to track the effort of verifying the potential issue and 
> fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-03-31 Thread zhengruifeng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072319#comment-17072319
 ] 

zhengruifeng commented on SPARK-31301:
--

{code:java}
@Since("3.1.0")
def testChiSquare(
dataset: Dataset[_],
featuresCol: String,
labelCol: String): Array[SelectionTestResult]
 {code}
 

this method is newliy added, what about changing the return type to flatten 
dataframe to support filtering?

all ping [~huaxingao]

 

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31300) Migrate the implementation of algorithms from MLlib to ML

2020-03-31 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31300.
--
Resolution: Not A Problem

> Migrate the implementation of algorithms from MLlib to ML
> -
>
> Key: SPARK-31300
> URL: https://issues.apache.org/jira/browse/SPARK-31300
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> Migrate the implementation of algorithms from MLlib to ML and make MLlib a 
> wrapper of the ML side. Including:
> 1, clustering: KMeans, BiKMeans, LDA, PIC
> 2, regression: AFT, Isotonic,
> 3, stat: ChiSquareTest, Correlation, KolmogorovSmirnovTest,
> 4, tree: remove the usage of OldAlgo/OldStrategy/etc
> 5, feature: Word2Vec, PCA
> 6, evaluation:
> 7, fpm: FPGrowth, PrefixSpan
> 8, linalg, MLUtils, etc



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31309) Migrate the ChiSquareTest from MLlib to ML

2020-03-31 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng resolved SPARK-31309.
--
Resolution: Not A Problem

> Migrate the ChiSquareTest from MLlib to ML
> --
>
> Key: SPARK-31309
> URL: https://issues.apache.org/jira/browse/SPARK-31309
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Move the impl of ChiSq from .mllib to the .ml side, and make .mllib.ChiSq a 
> wrapper of the .ml side.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072303#comment-17072303
 ] 

L. C. Hsieh commented on SPARK-31308:
-

Not sure why this JIRA ticket is not closed automatically. [~dongjoon]

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072300#comment-17072300
 ] 

L. C. Hsieh commented on SPARK-29358:
-

> users might rely on the exception when the schema is mismatched

This is what I concerned as behavior change. Maybe it is minor. I'm also good 
with this. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31290) Add back the deprecated R APIs

2020-03-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-31290.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Fixed in 
https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f

> Add back the deprecated R APIs
> --
>
> Key: SPARK-31290
> URL: https://issues.apache.org/jira/browse/SPARK-31290
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Add back the deprecated R APIs removed by
> https://github.com/apache/spark/pull/22843/
> https://github.com/apache/spark/pull/22815



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-31290) Add back the deprecated R APIs

2020-03-31 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072299#comment-17072299
 ] 

Hyukjin Kwon edited comment on SPARK-31290 at 4/1/20, 1:43 AM:
---

Fixed in https://github.com/apache/spark/pull/28058


was (Author: hyukjin.kwon):
Fixed in 
https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f

> Add back the deprecated R APIs
> --
>
> Key: SPARK-31290
> URL: https://issues.apache.org/jira/browse/SPARK-31290
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>
> Add back the deprecated R APIs removed by
> https://github.com/apache/spark/pull/22843/
> https://github.com/apache/spark/pull/22815



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072294#comment-17072294
 ] 

Hyukjin Kwon commented on SPARK-29358:
--

I am good with adding this behaviour if other committers think it's useful. 
Yeah, it's really extending rather than breaking.
My concern was:
 - users might rely on the exception when the schema is mismatched
 - workarounds might be possible with other combinations of APIs

After rethinking about this, the former could be guarded by a parameter or 
switch. The latter concern would need to extend the functionality of 
{{StructType.merge}} so I am not so against it.
Let me reopen the ticket.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-29358:
--

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29358:
-
Affects Version/s: (was: 3.0.0)
   3.1.0

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31319) Document UDF in SQL Reference

2020-03-31 Thread Huaxin Gao (Jira)

Huaxin Gao created SPARK-31319:
--

 Summary: Document UDF in SQL Reference
 Key: SPARK-31319
 URL: https://issues.apache.org/jira/browse/SPARK-31319
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SQL
Affects Versions: 3.0.0
Reporter: Huaxin Gao


Document UDF in SQL Reference



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31248) Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove

2020-03-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-31248:
-

Assignee: wuyi

> Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove
> --
>
> Key: SPARK-31248
> URL: https://issues.apache.org/jira/browse/SPARK-31248
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Assignee: wuyi
>Priority: Major
> Fix For: 3.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120300/testReport/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 12 did 
> not equal 8
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$51(ExecutorAllocationManagerSuite.scala:864)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31248) Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove

2020-03-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31248.
---
Fix Version/s: 3.1.0
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/28084 .

> Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove
> --
>
> Key: SPARK-31248
> URL: https://issues.apache.org/jira/browse/SPARK-31248
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 3.1.0
>Reporter: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.0
>
>
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120300/testReport/
> {code}
> sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 12 did 
> not equal 8
>   at 
> org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
>   at 
> org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
>   at 
> org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
>   at 
> org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
>   at 
> org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$51(ExecutorAllocationManagerSuite.scala:864)
>   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
>   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
>   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
>   at org.scalatest.Transformer.apply(Transformer.scala:22)
>   at org.scalatest.Transformer.apply(Transformer.scala:20)
>   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
>   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151)
>   at 
> org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31305) Add a page to list all commands in SQL Reference

2020-03-31 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-31305.
--
Fix Version/s: 3.0.0
 Assignee: Huaxin Gao
   Resolution: Fixed

Resolved by [https://github.com/apache/spark/pull/28074]

> Add a page to list all commands in SQL Reference
> 
>
> Key: SPARK-31305
> URL: https://issues.apache.org/jira/browse/SPARK-31305
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>
> Add a page to list all commands in SQL Reference, so it will be easier for 
> user to find a specific command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive

2020-03-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-17592.
-
Resolution: Won't Fix

> SQL: CAST string as INT inconsistent with Hive
> --
>
> Key: SPARK-17592
> URL: https://issues.apache.org/jira/browse/SPARK-17592
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Furcy Pin
>Priority: Major
> Attachments: image-2018-05-24-17-10-24-515.png
>
>
> Hello,
> there seem to be an inconsistency between Spark and Hive when casting a 
> string into an Int. 
> With Hive:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 0
> select cast("0.6" as INT) ;
> > 0
> {code}
> With Spark-SQL:
> {code}
> select cast("0.4" as INT) ;
> > 0
> select cast("0.5" as INT) ;
> > 1
> select cast("0.6" as INT) ;
> > 1
> {code}
> Hive seems to perform a floor(string.toDouble), while Spark seems to perform 
> a round(string.toDouble)
> I'm not sure there is any ISO standard for this, mysql has the same behavior 
> than Hive, while postgresql performs a string.toInt and throws an 
> NumberFormatException
> Personnally I think Hive is right, hence my posting this here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072212#comment-17072212
 ] 

fqaiser94 commented on SPARK-22231:
---

Makes sense, thanks!

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+
> {code}

[jira] [Resolved] (SPARK-31304) Add ANOVATest example

2020-03-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-31304.
--
Fix Version/s: 3.1.0
   Resolution: Fixed

Issue resolved by pull request 28073
[https://github.com/apache/spark/pull/28073]

> Add ANOVATest example
> -
>
> Key: SPARK-31304
> URL: https://issues.apache.org/jira/browse/SPARK-31304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: kevin yu
>Priority: Minor
> Fix For: 3.1.0
>
>
> Add ANOVATest examples (Scala, Java and Python)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31304) Add ANOVATest example

2020-03-31 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-31304:


Assignee: kevin yu

> Add ANOVATest example
> -
>
> Key: SPARK-31304
> URL: https://issues.apache.org/jira/browse/SPARK-31304
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Affects Versions: 3.1.0
>Reporter: Huaxin Gao
>Assignee: kevin yu
>Priority: Minor
>
> Add ANOVATest examples (Scala, Java and Python)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-03-31 Thread DB Tsai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072142#comment-17072142
 ] 

DB Tsai commented on SPARK-17636:
-

[~MasterDDT] we are not able to merge a new feature to an old release. Being 
said that, we internally at Apple use this for two years in prod without 
issues. You can backport it if you want to use it in your env.

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31253) add metrics to shuffle reader

2020-03-31 Thread Xiao Li (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-31253.
-
Fix Version/s: 3.1.0
   Resolution: Fixed

> add metrics to shuffle reader
> -
>
> Key: SPARK-31253
> URL: https://issues.apache.org/jira/browse/SPARK-31253
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.1.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields

2020-03-31 Thread Mitesh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072119#comment-17072119
 ] 

Mitesh commented on SPARK-17636:


[~cloud_fan] [~dbtsai] thanks for fixing! Is there any plan to backport to 2.4x 
branch?

> Parquet predicate pushdown for nested fields
> 
>
> Key: SPARK-17636
> URL: https://issues.apache.org/jira/browse/SPARK-17636
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core, SQL
>Affects Versions: 1.6.2, 1.6.3, 2.0.2
>Reporter: Mitesh
>Assignee: DB Tsai
>Priority: Minor
> Fix For: 3.0.0
>
>
> There's a *PushedFilters* for a simple numeric field, but not for a numeric 
> field inside a struct. Not sure if this is a Spark limitation because of 
> Parquet, or only a Spark limitation.
> {noformat}
> scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", 
> "sale_id")
> res5: org.apache.spark.sql.DataFrame = [day_timestamp: 
> struct, sale_id: bigint]
> scala> res5.filter("sale_id > 4").queryExecution.executedPlan
> res9: org.apache.spark.sql.execution.SparkPlan =
> Filter[23814] [args=(sale_id#86324L > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)]
> scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan
> res10: org.apache.spark.sql.execution.SparkPlan =
> Filter[23815] [args=(day_timestamp#86302.timestamp > 
> 4)][outPart=UnknownPartitioning(0)][outOrder=List()]
> +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: 
> s3a://some/parquet/file
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31265) Add -XX:MaxDirectMemorySize jvm options in yarn mode

2020-03-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-31265.
---
Resolution: Invalid

> Add -XX:MaxDirectMemorySize jvm options in yarn mode
> 
>
> Key: SPARK-31265
> URL: https://issues.apache.org/jira/browse/SPARK-31265
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 3.0.0
>Reporter: wangzhun
>Priority: Minor
>
> Current memory composition `amMemory` + `amMemoryOverhead`
> {code:java}
> val capability = Records.newRecord(classOf[Resource])
> capability.setMemory(amMemory + amMemoryOverhead)
> capability.setVirtualCores(amCores)
> if (amResources.nonEmpty) {
>  ResourceRequestHelper.setResourceRequests(amResources, capability)
> }
> logDebug(s"Created resource capability for AM request: $capability")
> {code}
> {code:java}
>  // Add Xmx for AM memory 
> javaOpts += "-Xmx" + amMemory + "m"
> {code}
> It is possible that the physical memory of the container exceeds the limit 
> and is killed by yarn.
> I suggest setting `-XX:MaxDirectMemorySize` here
> {code:java}
> // Add Xmx for AM memory
> javaOpts += "-Xmx" + amMemory + "m"
> javaOpts += s"-XX:MaxDirectMemorySize=${amMemoryOverhead}m"{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31318:
---
Parent: SPARK-30951
Issue Type: Sub-task  (was: Improvement)

> Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
> -
>
> Key: SPARK-31318
> URL: https://issues.apache.org/jira/browse/SPARK-31318
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Currently, Spark provides 2 SQL configs to control rebasing of 
> dates/timestamps in Parquet and Avro datasource: 
> spark.sql.legacy.parquet.rebaseDateTime.enabled
> spark.sql.legacy.avro.rebaseDateTime.enabled
> The configs control rebasing in read and in write. That's can be inconvenient 
> for users who want to read files saved by Spark 2.4 and earlier versions, and 
> save dates/timestamps without rebasing.
> The ticket aims to split the configs, and introduce separate SQL configs for 
> read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write

2020-03-31 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31318:
--

 Summary: Split Parquet/Avro configs for rebasing dates/timestamps 
in read and in write
 Key: SPARK-31318
 URL: https://issues.apache.org/jira/browse/SPARK-31318
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


Currently, Spark provides 2 SQL configs to control rebasing of dates/timestamps 
in Parquet and Avro datasource: 

spark.sql.legacy.parquet.rebaseDateTime.enabled

spark.sql.legacy.avro.rebaseDateTime.enabled

The configs control rebasing in read and in write. That's can be inconvenient 
for users who want to read files saved by Spark 2.4 and earlier versions, and 
save dates/timestamps without rebasing.

The ticket aims to split the configs, and introduce separate SQL configs for 
read and for write. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30775) Improve DOC for executor metrics

2020-03-31 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-30775:
--
Fix Version/s: 3.0.1

> Improve DOC for executor metrics
> 
>
> Key: SPARK-30775
> URL: https://issues.apache.org/jira/browse/SPARK-30775
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.1.0, 3.0.1
>
>
> This aims to improve the description of the executor metrics in the 
> monitoring documentation. In particular by:
> - adding reference to the Prometheus end point, as implemented in SPARK-29064
> - extending the list and descripion of executor metrics, following up from 
> the work in SPARK-27157



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread DB Tsai (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071969#comment-17071969
 ] 

DB Tsai commented on SPARK-22231:
-

[~fqaiser94] Thanks for continuing this work. We implemented this feature while 
I was at Netflix, and it's ready useful for end users to manipulate nested 
dataframe. Currently, we try to not assign the ticket to prevent someone is 
being assigned but not works on it. Therefore, I unassigned this JIRA.

In your PR, you implement the first part, and I create a sub-task for it. 
https://issues.apache.org/jira/browse/SPARK-31317 Can you change the Jira 
number to  SPARK-31317 to have it properly linked? 

 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> //

[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2020-03-31 Thread Michael Armbrust (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071968#comment-17071968
 ] 

Michael Armbrust commented on SPARK-29358:
--

I think we should reconsider closing this as won't fix:
 - I think the semantics of this operation make sense. We already have this 
with writing JSON or parquet data. It is just a really inefficient way to 
accomplish the end goal.
 - I don't think it is a problem to move "away from SQL union". This is a 
clearly named, different operation. IMO this one makes *more* sense than SQL 
union. It is much more likely that columns with the same name are semantically 
equivalent than columns at the same ordinal with different names.
 - We are not breaking the behavior of unionByName. Currently it throws an 
exception in these cases. We are making more data transformations possible, but 
anything that was working before will continue to work. You could add a boolean 
flag if you were really concerned, but I think I would skip that.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31317) Add withField method to Column class

2020-03-31 Thread DB Tsai (Jira)

DB Tsai created SPARK-31317:
---

 Summary: Add withField method to Column class
 Key: SPARK-31317
 URL: https://issues.apache.org/jira/browse/SPARK-31317
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3
Reporter: DB Tsai






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread DB Tsai (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-22231:
---

Assignee: (was: Jeremy Smith)

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // |||-- b: double (nullable = true)
> result.show(false)
> // +---++--+
> // |foo|bar |items |
> // +---++--+
> // |10 |10.0|[[10,11.0], [11,12.0]]|
> // |20 |20.0|[[20,21.0], [21,22.0]]|
> // +---++--+
> {code}
> and the second one

[jira] [Commented] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark

2020-03-31 Thread Thomas Graves (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071953#comment-17071953
 ] 

Thomas Graves commented on SPARK-30873:
---

so I think this is a dup of SPARK-30835.

> Handling Node Decommissioning for Yarn cluster manger in Spark
> --
>
> Key: SPARK-30873
> URL: https://issues.apache.org/jira/browse/SPARK-30873
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 3.1.0
>Reporter: Saurabh Chawla
>Priority: Major
>
> In many public cloud environments, the node loss (in case of AWS 
> SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed 
> activity. 
> The cloud provider intimates the cluster manager about the possible loss of 
> node ahead of time. Few examples is listed here:
> a) Spot loss in AWS(2 min before event)
> b) GCP Pre-emptible VM loss (30 second before event)
> c) AWS Spot block loss with info on termination time (generally few tens of 
> minutes before decommission as configured in Yarn)
> This JIRA tries to make spark leverage the knowledge of the node loss in 
> future, and tries to adjust the scheduling of tasks to minimise the impact on 
> the application. 
> It is well known that when a host is lost, the executors, its running tasks, 
> their caches and also Shuffle data is lost. This could result in wastage of 
> compute and other resources.
> The focus here is to build a framework for YARN, that can be extended for 
> other cluster managers to handle such scenario.
> The framework must handle one or more of the following:-
> 1) Prevent new tasks from starting on any executors on decommissioning Nodes. 
> 2) Decide to kill the running tasks so that they can be restarted elsewhere 
> (assuming they will not complete within the deadline) OR we can allow them to 
> continue hoping they will finish within deadline.
> 3) Clear the shuffle data entry from MapOutputTracker of decommission node 
> hostname to prevent the shuffle fetchfailed exception.The most significant 
> advantage of unregistering shuffle outputs when Spark schedules the first 
> re-attempt to compute the missing blocks, it notices all of the missing 
> blocks from decommissioned nodes and recovers in only one attempt. This 
> speeds up the recovery process significantly over the scheduled Spark 
> implementation, where stages might be rescheduled multiple times to recompute 
> missing shuffles from all nodes, and prevent jobs from being stuck for hours 
> failing and recomputing.
> 4) Prevent the stage to abort due to the fetchfailed exception in case of 
> decommissioning of node. In Spark there is number of consecutive stage 
> attempts allowed before a stage is aborted.This is controlled by the config 
> spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due 
> decommissioning of nodes towards stage failure improves the reliability of 
> the system.
> Main components of change
> 1) Get the ClusterInfo update from the Resource Manager -> Application Master 
> -> Spark Driver.
> 2) DecommissionTracker, resides inside driver, tracks all the decommissioned 
> nodes and take necessary action and state transition.
> 3) Based on the decommission node list add hooks at code to achieve
>  a) No new task on executor
>  b) Remove shuffle data mapping info for the node to be decommissioned from 
> the mapOutputTracker
>  c) Do not count fetchFailure from decommissioned towards stage failure
> On the receiving info that node is to be decommissioned, the below action 
> needs to be performed by DecommissionTracker on driver:
>  * Add the entry of Nodes in DecommissionTracker with termination time and 
> node state as "DECOMMISSIONING".
>  * Stop assigning any new tasks on executors on the nodes which are candidate 
> for decommission. This makes sure slowly as the tasks finish the usage of 
> this node would die down.
>  * Kill all the executors for the decommissioning nodes after configurable 
> period of time, say "spark.graceful.decommission.executor.leasetimePct". This 
> killing ensures two things. Firstly, the task failure will be attributed in 
> job failure count. Second, avoid generation on more shuffle data on the node 
> that will eventually be lost. The node state is set to 
> "EXECUTOR_DECOMMISSIONED". 
>  * Mark Shuffle data on the node as unavailable after 
> "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will 
> ensure that recomputation of missing shuffle partition is done early, rather 
> than reducers failing with a time-consuming FetchFailure. The node state is 
> set to "SHUFFLE_DECOMMISSIONED". 
>  * Mark Node as Terminated after the termination time. Now the state of the 
> node is "TERMINATED".
>  * Remove the node entry from Decommission Tracker if the

[jira] [Updated] (SPARK-31230) use statement plans in DataFrameWriter(V2)

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-31230:

Fix Version/s: (was: 3.0.0)
   3.0.1

> use statement plans in DataFrameWriter(V2)
> --
>
> Key: SPARK-31230
> URL: https://issues.apache.org/jira/browse/SPARK-31230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31312.
-
Fix Version/s: 3.0.0
 Assignee: Jungtaek Lim
   Resolution: Fixed

> Transforming Hive simple UDF (using JAR) expression may incur CNFE in later 
> evaluation
> --
>
> Key: SPARK-31312
> URL: https://issues.apache.org/jira/browse/SPARK-31312
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.5, 3.0.0, 3.1.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of 
> current thread context classloader.
> [~cloud_fan] pointed out another potential issue in post-review of 
> SPARK-26560 - quoting the comment:
> {quote}
> Found a potential problem: here we call HiveSimpleUDF.dateType (which is a 
> lazy val), to force to load the class with the corrected class loader.
> However, if the expression gets transformed later, which copies 
> HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class 
> loading, and at that time there is no guarantee that the corrected 
> classloader is used.
> I think we should materialize the loaded class in HiveSimpleUDF.
> {quote}
> This JIRA issue is to track the effort of verifying the potential issue and 
> fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures

2020-03-31 Thread fqaiser94 (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071945#comment-17071945
 ] 

fqaiser94 commented on SPARK-22231:
---

Excellent, thanks Reynold. 
If somebody could assign this ticket to me, that would be great as I have the 
code ready for all 3 methods. 
Pull request for the first one ({{withField}}) is ready here: 
[https://github.com/apache/spark/pull/27066]
Looking forward to the reviews. 

> Support of map, filter, withColumn, dropColumn in nested list of structures
> ---
>
> Key: SPARK-22231
> URL: https://issues.apache.org/jira/browse/SPARK-22231
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: DB Tsai
>Assignee: Jeremy Smith
>Priority: Major
>
> At Netflix's algorithm team, we work on ranking problems to find the great 
> content to fulfill the unique tastes of our members. Before building a 
> recommendation algorithms, we need to prepare the training, testing, and 
> validation datasets in Apache Spark. Due to the nature of ranking problems, 
> we have a nested list of items to be ranked in one column, and the top level 
> is the contexts describing the setting for where a model is to be used (e.g. 
> profiles, country, time, device, etc.)  Here is a blog post describing the 
> details, [Distributed Time Travel for Feature 
> Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907].
>  
> To be more concrete, for the ranks of videos for a given profile_id at a 
> given country, our data schema can be looked like this,
> {code:java}
> root
>  |-- profile_id: long (nullable = true)
>  |-- country_iso_code: string (nullable = true)
>  |-- items: array (nullable = false)
>  ||-- element: struct (containsNull = false)
>  |||-- title_id: integer (nullable = true)
>  |||-- scores: double (nullable = true)
> ...
> {code}
> We oftentimes need to work on the nested list of structs by applying some 
> functions on them. Sometimes, we're dropping or adding new columns in the 
> nested list of structs. Currently, there is no easy solution in open source 
> Apache Spark to perform those operations using SQL primitives; many people 
> just convert the data into RDD to work on the nested level of data, and then 
> reconstruct the new dataframe as workaround. This is extremely inefficient 
> because all the optimizations like predicate pushdown in SQL can not be 
> performed, we can not leverage on the columnar format, and the serialization 
> and deserialization cost becomes really huge even we just want to add a new 
> column in the nested level.
> We built a solution internally at Netflix which we're very happy with. We 
> plan to make it open source in Spark upstream. We would like to socialize the 
> API design to see if we miss any use-case.  
> The first API we added is *mapItems* on dataframe which take a function from 
> *Column* to *Column*, and then apply the function on nested dataframe. Here 
> is an example,
> {code:java}
> case class Data(foo: Int, bar: Double, items: Seq[Double])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)),
>   Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4))
> ))
> val result = df.mapItems("items") {
>   item => item * 2.0
> }
> result.printSchema()
> // root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: double (containsNull = true)
> result.show()
> // +---+++
> // |foo| bar|   items|
> // +---+++
> // | 10|10.0|[20.2, 20.4, 20.6...|
> // | 20|20.0|[40.2, 40.4, 40.6...|
> // +---+++
> {code}
> Now, with the ability of applying a function in the nested dataframe, we can 
> add a new function, *withColumn* in *Column* to add or replace the existing 
> column that has the same name in the nested list of struct. Here is two 
> examples demonstrating the API together with *mapItems*; the first one 
> replaces the existing column,
> {code:java}
> case class Item(a: Int, b: Double)
> case class Data(foo: Int, bar: Double, items: Seq[Item])
> val df: Dataset[Data] = spark.createDataset(Seq(
>   Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))),
>   Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0)))
> ))
> val result = df.mapItems("items") {
>   item => item.withColumn(item("b") + 1 as "b")
> }
> result.printSchema
> root
> // |-- foo: integer (nullable = false)
> // |-- bar: double (nullable = false)
> // |-- items: array (nullable = true)
> // ||-- element: struct (containsNull = true)
> // |||-- a: integer (nullable = true)
> // ||

[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071936#comment-17071936
 ] 

L. C. Hsieh commented on SPARK-31308:
-

Thanks [~dongjoon]. I changed the `Affected Version`.

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-31308:

Affects Version/s: (was: 3.0.0)
   3.1.0

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.1.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31230) use statement plans in DataFrameWriter(V2)

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31230.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27992
[https://github.com/apache/spark/pull/27992]

> use statement plans in DataFrameWriter(V2)
> --
>
> Key: SPARK-31230
> URL: https://issues.apache.org/jira/browse/SPARK-31230
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications

2020-03-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071928#comment-17071928
 ] 

Dongjoon Hyun commented on SPARK-31308:
---

Yes. For `New Feature` or `Improvement`, we should use `master` branch version 
because we don't backport.
The current snapshot version of `master` branch is `3.1.0-SNAPSHOT`. So, you 
need to use `3.1.0`.

> Make Python dependencies available for Non-PySpark applications
> ---
>
> Key: SPARK-31308
> URL: https://issues.apache.org/jira/browse/SPARK-31308
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Submit
>Affects Versions: 3.0.0
>Reporter: L. C. Hsieh
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-03-31 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071926#comment-17071926
 ] 

Dongjoon Hyun commented on SPARK-25102:
---

Thanks, [~cloud_fan].

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"

2020-03-31 Thread Felix Kizhakkel Jose (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071819#comment-17071819
 ] 

Felix Kizhakkel Jose commented on SPARK-31072:
--

Hello,
Any updates or helps will be much appreciated. 

> Default to ParquetOutputCommitter even after configuring s3a committer as 
> "partitioned"
> ---
>
> Key: SPARK-31072
> URL: https://issues.apache.org/jira/browse/SPARK-31072
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.4.5
>Reporter: Felix Kizhakkel Jose
>Priority: Major
>
> My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ 
> even after I configure to use "PartitionedStagingCommitter" with the 
> following configuration:
>  * 
> sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a",
>  "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory");
>  * sparkSession.conf().set("fs.s3a.committer.name", "partitioned");
>  * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", 
> "append");
>  * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false");
>  * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", 
> false);
> Application logs stacktrace:
> 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for 
> Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined 
> output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 2
> 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output 
> committer class org.apache.parquet.hadoop.ParquetOutputCommitter
> But when I use _*ORC*_ as the file format, with the same configuration as 
> above it correctly pick "PartitionedStagingCommitter":
> 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm 
> version is 1
> 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
> _temporary folders under output directory:false, ignore cleanup failures: 
> false
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer 
> partitioned to output data to s3a:
> 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter 
> PartitionedStagingCommitter**
> So I am wondering why Parquet and ORC has different behavior ?
> How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter?
> I started this because when I was trying to save data to S3 directly with 
> partitionBy() two columns -  I was getting  file not found exceptions 
> intermittently.  
> So how could I avoid this issue with *Parquet  using Spark to S3 using s3A 
> without s3aGuard?*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat

2020-03-31 Thread Sean R. Owen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071807#comment-17071807
 ] 

Sean R. Owen commented on SPARK-31301:
--

Hm, I'm not sure I'd add a separate method for this either. We'd have three 
types of return values then, just adding to the complexity.

> flatten the result dataframe of tests in stat
> -
>
> Key: SPARK-31301
> URL: https://issues.apache.org/jira/browse/SPARK-31301
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Major
>
> {code:java}
>  scala> import org.apache.spark.ml.linalg.{Vector, Vectors}
> import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import 
> org.apache.spark.ml.stat.ChiSquareTest
> import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq(
>  |   (0.0, Vectors.dense(0.5, 10.0)),
>  |   (0.0, Vectors.dense(1.5, 20.0)),
>  |   (1.0, Vectors.dense(1.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 30.0)),
>  |   (0.0, Vectors.dense(3.5, 40.0)),
>  |   (1.0, Vectors.dense(3.5, 40.0))
>  | )
> data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = 
> List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), 
> (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = 
> data.toDF("label", "features")
> df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala>  
>val chi = ChiSquareTest.test(df, "features", "label")
> chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: 
> array ... 1 more field]scala> chi.show
> +++--+
> | pValues|degreesOfFreedom|statistics|
> +++--+
> |[0.68728927879097...|  [2, 3]|[0.75,1.5]|
> +++--+{code}
>  
> Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, 
> {{Correlation}} all return a df only containing one row.
> I think this is quite hard to use, suppose we have a dataset with dim=1000, 
> the only operation we can deal with the test result is to collect it by 
> {{head()}} or {{first(), and then use it in the driver.}}
> {{While what I really want to do is filtering the df like pValue>0.1}} or 
> {{corr<0.5}}, *So I suggest to flatten the output df in those tests.*
>  
> {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but 
> ChiSquareTest and Correlation were here for a long time.
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-03-31 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071803#comment-17071803
 ] 

Wenchen Fan commented on SPARK-25102:
-

ok nvm, I checked ORC and it doesn't have the "createdBy" field. Let's keep 
using this consistent way to record spark version in parquet/orc.

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

2020-03-31 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves updated SPARK-31314:
--
Issue Type: Bug  (was: Improvement)

> Revert SPARK-29285 to fix shuffle regression caused by creating temporary 
> file eagerly
> --
>
> Key: SPARK-31314
> URL: https://issues.apache.org/jira/browse/SPARK-31314
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-29285, we change to create shuffle temporary eagerly. This is 
> helpful for not to fail the entire task in the scenario of occasional disk 
> failure.
> But for the applications that many tasks don't actually create shuffle files, 
> it caused overhead. See the below benchmark:
> Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 
> times.
> Data: TPC-DS scale=99 generate by spark-tpcds-datagen
> Results:
> || ||Base||Revert||
> |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) 
> Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 
> 2.224627274) Median 2.586498463|
> |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) 
> Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 
> 3.783188024) Median 4.082311276|
> |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) 
> Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 
> 2.606163423) Median 3.196025108|
> |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) 
> Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 
> 3.657525982) Median 4.195202502|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

2020-03-31 Thread jiaan.geng (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-31316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071711#comment-17071711
 ] 

jiaan.geng commented on SPARK-31316:


I'm working on.

> SQLQueryTestSuite: Display the total generate time for generated java code.
> ---
>
> Key: SPARK-31316
> URL: https://issues.apache.org/jira/browse/SPARK-31316
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: jiaan.geng
>Priority: Major
>
> SQLQueryTestSuite spent a lot of time generate java code when using whole 
> codeine.
> We should display the total generate time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.

2020-03-31 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-31316:
--

 Summary: SQLQueryTestSuite: Display the total generate time for 
generated java code.
 Key: SPARK-31316
 URL: https://issues.apache.org/jira/browse/SPARK-31316
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


SQLQueryTestSuite spent a lot of time generate java code when using whole 
codeine.
We should display the total generate time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-29285:

Fix Version/s: (was: 3.0.0)

> Temporary shuffle and local block should be able to handle disk failures
> 
>
> Key: SPARK-29285
> URL: https://issues.apache.org/jira/browse/SPARK-29285
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
>
> {code:java}
> java.io.FileNotFoundException: 
> /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f
>  (No such file or directory)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Local or temp shuffle files are initialized without checking because the 
> getFile method in DiskBlockManager probably return an existing subdirectory. 
> Sometimes, when a disk failure occurs, those files may become inaccessible 
> and throw FileNotFoundException later, which may fail the entire task. Task 
> re-running is a bit heavy for these errors, we may give another or more disks 
> a try at least.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures

2020-03-31 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071699#comment-17071699
 ] 

Wenchen Fan commented on SPARK-29285:
-

This is reverted in https://github.com/apache/spark/pull/28072

> Temporary shuffle and local block should be able to handle disk failures
> 
>
> Key: SPARK-29285
> URL: https://issues.apache.org/jira/browse/SPARK-29285
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Assignee: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> java.io.FileNotFoundException: 
> /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f
>  (No such file or directory)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Local or temp shuffle files are initialized without checking because the 
> getFile method in DiskBlockManager probably return an existing subdirectory. 
> Sometimes, when a disk failure occurs, those files may become inaccessible 
> and throw FileNotFoundException later, which may fail the entire task. Task 
> re-running is a bit heavy for these errors, we may give another or more disks 
> a try at least.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reopened SPARK-29285:
-
  Assignee: (was: Kent Yao)

> Temporary shuffle and local block should be able to handle disk failures
> 
>
> Key: SPARK-29285
> URL: https://issues.apache.org/jira/browse/SPARK-29285
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 3.0.0
>Reporter: Kent Yao
>Priority: Major
> Fix For: 3.0.0
>
>
> {code:java}
> java.io.FileNotFoundException: 
> /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f
>  (No such file or directory)
>   at java.io.FileOutputStream.open0(Native Method)
>   at java.io.FileOutputStream.open(FileOutputStream.java:270)
>   at java.io.FileOutputStream.(FileOutputStream.java:213)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
>   at 
> org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209)
>   at 
> org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
>   at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:109)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> Local or temp shuffle files are initialized without checking because the 
> getFile method in DiskBlockManager probably return an existing subdirectory. 
> Sometimes, when a disk failure occurs, those files may become inaccessible 
> and throw FileNotFoundException later, which may fail the entire task. Task 
> re-running is a bit heavy for these errors, we may give another or more disks 
> a try at least.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31314.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28072
[https://github.com/apache/spark/pull/28072]

> Revert SPARK-29285 to fix shuffle regression caused by creating temporary 
> file eagerly
> --
>
> Key: SPARK-31314
> URL: https://issues.apache.org/jira/browse/SPARK-31314
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
> Fix For: 3.0.0
>
>
> In SPARK-29285, we change to create shuffle temporary eagerly. This is 
> helpful for not to fail the entire task in the scenario of occasional disk 
> failure.
> But for the applications that many tasks don't actually create shuffle files, 
> it caused overhead. See the below benchmark:
> Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 
> times.
> Data: TPC-DS scale=99 generate by spark-tpcds-datagen
> Results:
> || ||Base||Revert||
> |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) 
> Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 
> 2.224627274) Median 2.586498463|
> |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) 
> Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 
> 3.783188024) Median 4.082311276|
> |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) 
> Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 
> 2.606163423) Median 3.196025108|
> |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) 
> Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 
> 3.657525982) Median 4.195202502|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31314:
---

Assignee: Yuanjian Li

> Revert SPARK-29285 to fix shuffle regression caused by creating temporary 
> file eagerly
> --
>
> Key: SPARK-31314
> URL: https://issues.apache.org/jira/browse/SPARK-31314
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Major
>
> In SPARK-29285, we change to create shuffle temporary eagerly. This is 
> helpful for not to fail the entire task in the scenario of occasional disk 
> failure.
> But for the applications that many tasks don't actually create shuffle files, 
> it caused overhead. See the below benchmark:
> Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 
> times.
> Data: TPC-DS scale=99 generate by spark-tpcds-datagen
> Results:
> || ||Base||Revert||
> |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) 
> Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 
> 2.224627274) Median 2.586498463|
> |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) 
> Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 
> 3.783188024) Median 4.082311276|
> |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) 
> Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 
> 2.606163423) Median 3.196025108|
> |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) 
> Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 
> 3.657525982) Median 4.195202502|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata

2020-03-31 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071692#comment-17071692
 ] 

Wenchen Fan commented on SPARK-25102:
-

It's not completely orthogonal as we can merge these two. e.g. set the writer 
name as `spark-3.0.0` or `spark-2.4.0`.

> Write Spark version to ORC/Parquet file metadata
> 
>
> Key: SPARK-25102
> URL: https://issues.apache.org/jira/browse/SPARK-25102
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Zoltan Ivanfi
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, Spark writes Spark version number into Hive Table properties with 
> `spark.sql.create.version`.
> {code}
> parameters:{
>   spark.sql.sources.schema.part.0={
> "type":"struct",
> "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}]
>   },
>   transient_lastDdlTime=1541142761, 
>   spark.sql.sources.schema.numParts=1,
>   spark.sql.create.version=2.4.0
> }
> {code}
> This issue aims to write Spark versions to ORC/Parquet file metadata with 
> `org.apache.spark.sql.create.version`. It's different from Hive Table 
> property key `spark.sql.create.version`. It seems that we cannot change that 
> for backward compatibility (even in Apache Spark 3.0)
> *ORC*
> {code}
> User Metadata:
>   org.apache.spark.sql.create.version=3.0.0-SNAPSHOT
> {code}
> *PARQUET*
> {code}
> file:
> file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet
> creator: parquet-mr version 1.10.0 (build 
> 031a6654009e3b82020012a18434c582bd74c73a)
> extra:   org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT
> extra:   org.apache.spark.sql.parquet.row.metadata = 
> {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]}
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.

2020-03-31 Thread jiaan.geng (Jira)

jiaan.geng created SPARK-31315:
--

 Summary: SQLQueryTestSuite: Display the total compile time for 
generated java code.
 Key: SPARK-31315
 URL: https://issues.apache.org/jira/browse/SPARK-31315
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: jiaan.geng


SQLQueryTestSuite spent a lot of time compiling the generated java code.
We should display the total compile time for generated java code.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-31297) Speed-up date-time rebasing

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-31297:
---

Assignee: Maxim Gekk

> Speed-up date-time rebasing
> ---
>
> Key: SPARK-31297
> URL: https://issues.apache.org/jira/browse/SPARK-31297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> I do believe it is possible to speed up date-time rebasing by building a map 
> of micros to diffs between original and rebased micros. And look up at the 
> map via binary search.
> For example, the *America/Los_Angeles* time zone has less than 100 points 
> when diff changes:
> {code:scala}
>   test("optimize rebasing") {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += MICROS_PER_HOUR
> }
> println(s"counter = $counter")
>   }
> {code}
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2909 minutes
> local date-time = 0100-02-28T14:00 diff = -1469 minutes
> local date-time = 0200-02-28T14:00 diff = -29 minutes
> local date-time = 0300-02-28T14:00 diff = 1410 minutes
> local date-time = 0500-02-28T14:00 diff = 2850 minutes
> local date-time = 0600-02-28T14:00 diff = 4290 minutes
> local date-time = 0700-02-28T14:00 diff = 5730 minutes
> local date-time = 0900-02-28T14:00 diff = 7170 minutes
> local date-time = 1000-02-28T14:00 diff = 8610 minutes
> local date-time = 1100-02-28T14:00 diff = 10050 minutes
> local date-time = 1300-02-28T14:00 diff = 11490 minutes
> local date-time = 1400-02-28T14:00 diff = 12930 minutes
> local date-time = 1500-02-28T14:00 diff = 14370 minutes
> local date-time = 1582-10-14T14:00 diff = -29 minutes
> local date-time = 1899-12-31T16:52:58 diff = 0 minutes
> local date-time = 1917-12-27T11:52:58 diff = 60 minutes
> local date-time = 1917-12-27T12:52:58 diff = 0 minutes
> local date-time = 1918-09-15T12:52:58 diff = 60 minutes
> local date-time = 1918-09-15T13:52:58 diff = 0 minutes
> local date-time = 1919-06-30T16:52:58 diff = 31 minutes
> local date-time = 1919-06-30T17:52:58 diff = 0 minutes
> local date-time = 1919-08-15T12:52:58 diff = 60 minutes
> local date-time = 1919-08-15T13:52:58 diff = 0 minutes
> local date-time = 1921-08-31T10:52:58 diff = 60 minutes
> local date-time = 1921-08-31T11:52:58 diff = 0 minutes
> local date-time = 1921-09-30T11:52:58 diff = 60 minutes
> local date-time = 1921-09-30T12:52:58 diff = 0 minutes
> local date-time = 1922-09-30T12:52:58 diff = 60 minutes
> local date-time = 1922-09-30T13:52:58 diff = 0 minutes
> local date-time = 1981-09-30T12:52:58 diff = 60 minutes
> local date-time = 1981-09-30T13:52:58 diff = 0 minutes
> local date-time = 1982-09-30T12:52:58 diff = 60 minutes
> local date-time = 1982-09-30T13:52:58 diff = 0 minutes
> local date-time = 1983-09-30T12:52:58 diff = 60 minutes
> local date-time = 1983-09-30T13:52:58 diff = 0 minutes
> local date-time = 1984-09-29T15:52:58 diff = 60 minutes
> local date-time = 1984-09-29T16:52:58 diff = 0 minutes
> local date-time = 1985-09-28T15:52:58 diff = 60 minutes
> local date-time = 1985-09-28T16:52:58 diff = 0 minutes
> local date-time = 1986-09-27T15:52:58 diff = 60 minutes
> local date-time = 1986-09-27T16:52:58 diff = 0 minutes
> local date-time = 1987-09-26T15:52:58 diff = 60 minutes
> local date-time = 1987-09-26T16:52:58 diff = 0 minutes
> local date-time = 1988-09-24T15:52:58 diff = 60 minutes
> local date-time = 1988-09-24T16:52:58 diff = 0 minutes
> local date-time = 1989-09-23T15:52:58 diff = 60 minutes
> local date-time = 1989-09-23T16:52:58 diff = 0 minutes
> local date-time = 1990-09-29T15:52:58 diff = 60 minutes
> local date-time = 1990-09-29T16:52:58 diff = 0 minutes
> local date-time = 1991-09-28T16:52:58 diff = 60 minutes
> local date-time = 1991-09-28T17:52:58 diff = 0 minutes
> local date-time = 1992-09-26T15:52:58 diff = 60 minutes
> local date-time = 1992-09-26T16:52:58 diff = 0 minutes
> local date-time = 1993-09-25T15:52:58 diff = 60 minutes
> local date-time = 1993-09-25T16:52:58 diff = 0 minutes
> local date-time =

[jira] [Resolved] (SPARK-31297) Speed-up date-time rebasing

2020-03-31 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-31297.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 28067
[https://github.com/apache/spark/pull/28067]

> Speed-up date-time rebasing
> ---
>
> Key: SPARK-31297
> URL: https://issues.apache.org/jira/browse/SPARK-31297
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> I do believe it is possible to speed up date-time rebasing by building a map 
> of micros to diffs between original and rebased micros. And look up at the 
> map via binary search.
> For example, the *America/Los_Angeles* time zone has less than 100 points 
> when diff changes:
> {code:scala}
>   test("optimize rebasing") {
> val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
>   .atZone(getZoneId("America/Los_Angeles"))
>   .toInstant)
> var micros = start
> var diff = Long.MaxValue
> var counter = 0
> while (micros < end) {
>   val rebased = rebaseGregorianToJulianMicros(micros)
>   val curDiff = rebased - micros
>   if (curDiff != diff) {
> counter += 1
> diff = curDiff
> val ldt = 
> microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
> println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} 
> minutes")
>   }
>   micros += MICROS_PER_HOUR
> }
> println(s"counter = $counter")
>   }
> {code}
> {code:java}
> local date-time = 0001-01-01T00:00 diff = -2909 minutes
> local date-time = 0100-02-28T14:00 diff = -1469 minutes
> local date-time = 0200-02-28T14:00 diff = -29 minutes
> local date-time = 0300-02-28T14:00 diff = 1410 minutes
> local date-time = 0500-02-28T14:00 diff = 2850 minutes
> local date-time = 0600-02-28T14:00 diff = 4290 minutes
> local date-time = 0700-02-28T14:00 diff = 5730 minutes
> local date-time = 0900-02-28T14:00 diff = 7170 minutes
> local date-time = 1000-02-28T14:00 diff = 8610 minutes
> local date-time = 1100-02-28T14:00 diff = 10050 minutes
> local date-time = 1300-02-28T14:00 diff = 11490 minutes
> local date-time = 1400-02-28T14:00 diff = 12930 minutes
> local date-time = 1500-02-28T14:00 diff = 14370 minutes
> local date-time = 1582-10-14T14:00 diff = -29 minutes
> local date-time = 1899-12-31T16:52:58 diff = 0 minutes
> local date-time = 1917-12-27T11:52:58 diff = 60 minutes
> local date-time = 1917-12-27T12:52:58 diff = 0 minutes
> local date-time = 1918-09-15T12:52:58 diff = 60 minutes
> local date-time = 1918-09-15T13:52:58 diff = 0 minutes
> local date-time = 1919-06-30T16:52:58 diff = 31 minutes
> local date-time = 1919-06-30T17:52:58 diff = 0 minutes
> local date-time = 1919-08-15T12:52:58 diff = 60 minutes
> local date-time = 1919-08-15T13:52:58 diff = 0 minutes
> local date-time = 1921-08-31T10:52:58 diff = 60 minutes
> local date-time = 1921-08-31T11:52:58 diff = 0 minutes
> local date-time = 1921-09-30T11:52:58 diff = 60 minutes
> local date-time = 1921-09-30T12:52:58 diff = 0 minutes
> local date-time = 1922-09-30T12:52:58 diff = 60 minutes
> local date-time = 1922-09-30T13:52:58 diff = 0 minutes
> local date-time = 1981-09-30T12:52:58 diff = 60 minutes
> local date-time = 1981-09-30T13:52:58 diff = 0 minutes
> local date-time = 1982-09-30T12:52:58 diff = 60 minutes
> local date-time = 1982-09-30T13:52:58 diff = 0 minutes
> local date-time = 1983-09-30T12:52:58 diff = 60 minutes
> local date-time = 1983-09-30T13:52:58 diff = 0 minutes
> local date-time = 1984-09-29T15:52:58 diff = 60 minutes
> local date-time = 1984-09-29T16:52:58 diff = 0 minutes
> local date-time = 1985-09-28T15:52:58 diff = 60 minutes
> local date-time = 1985-09-28T16:52:58 diff = 0 minutes
> local date-time = 1986-09-27T15:52:58 diff = 60 minutes
> local date-time = 1986-09-27T16:52:58 diff = 0 minutes
> local date-time = 1987-09-26T15:52:58 diff = 60 minutes
> local date-time = 1987-09-26T16:52:58 diff = 0 minutes
> local date-time = 1988-09-24T15:52:58 diff = 60 minutes
> local date-time = 1988-09-24T16:52:58 diff = 0 minutes
> local date-time = 1989-09-23T15:52:58 diff = 60 minutes
> local date-time = 1989-09-23T16:52:58 diff = 0 minutes
> local date-time = 1990-09-29T15:52:58 diff = 60 minutes
> local date-time = 1990-09-29T16:52:58 diff = 0 minutes
> local date-time = 1991-09-28T16:52:58 diff = 60 minutes
> local date-time = 1991-09-28T17:52:58 diff = 0 minutes
> local date-time = 1992-09-26T15:52:58 diff = 60 minutes
> local date-time = 1992-09-26T16:52:58 diff = 0 minutes

[jira] [Created] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly

2020-03-31 Thread Yuanjian Li (Jira)

Yuanjian Li created SPARK-31314:
---

 Summary: Revert SPARK-29285 to fix shuffle regression caused by 
creating temporary file eagerly
 Key: SPARK-31314
 URL: https://issues.apache.org/jira/browse/SPARK-31314
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yuanjian Li


In SPARK-29285, we change to create shuffle temporary eagerly. This is helpful 
for not to fail the entire task in the scenario of occasional disk failure.

But for the applications that many tasks don't actually create shuffle files, 
it caused overhead. See the below benchmark:

Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 
times.
Data: TPC-DS scale=99 generate by spark-tpcds-datagen

Results:
|| ||Base||Revert||
|Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) 
Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 
2.224627274) Median 2.586498463|
|Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) 
Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 
3.783188024) Median 4.082311276|
|Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) 
Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 
2.606163423) Median 3.196025108|
|Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) 
Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 
3.657525982) Median 4.195202502|



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x

2020-03-31 Thread Dongjoon Hyun (Jira)

Dongjoon Hyun created SPARK-31313:
-

 Summary: Add `m01` node name to support Minikube 1.8.x
 Key: SPARK-31313
 URL: https://issues.apache.org/jira/browse/SPARK-31313
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes, Tests
Affects Versions: 3.1.0
Reporter: Dongjoon Hyun






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation

2020-03-31 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-31312:


 Summary: Transforming Hive simple UDF (using JAR) expression may 
incur CNFE in later evaluation
 Key: SPARK-31312
 URL: https://issues.apache.org/jira/browse/SPARK-31312
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.5, 3.0.0, 3.1.0
Reporter: Jungtaek Lim


In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of 
current thread context classloader.

[~cloud_fan] pointed out another potential issue in post-review of SPARK-26560 
- quoting the comment:

{quote}
Found a potential problem: here we call HiveSimpleUDF.dateType (which is a lazy 
val), to force to load the class with the corrected class loader.

However, if the expression gets transformed later, which copies HiveSimpleUDF, 
then calling HiveSimpleUDF.dataType will re-trigger the class loading, and at 
that time there is no guarantee that the corrected classloader is used.

I think we should materialize the loaded class in HiveSimpleUDF.
{quote}

This JIRA issue is to track the effort of verifying the potential issue and 
fixing the issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-30879) Refine doc-building workflow

2020-03-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reopened SPARK-30879:
--
  Assignee: (was: Nicholas Chammas)

Reverted at 
https://github.com/apache/spark/commit/4d4c3e76f6d1d5ede511c3ff4036b0c458a0a4e3

> Refine doc-building workflow
> 
>
> Key: SPARK-30879
> URL: https://issues.apache.org/jira/browse/SPARK-30879
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30879) Refine doc-building workflow

2020-03-31 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-30879:
-
Fix Version/s: (was: 3.1.0)

> Refine doc-building workflow
> 
>
> Key: SPARK-30879
> URL: https://issues.apache.org/jira/browse/SPARK-30879
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
>Priority: Minor
>
> There are a few rough edges in the workflow for building docs that could be 
> refined:
>  * sudo pip installing stuff
>  * no pinned versions of any doc dependencies
>  * using some deprecated options
>  * race condition with jekyll serve



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-31311) Benchmark date-time rebasing in ORC datasource

2020-03-31 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-31311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-31311:
---
Description: 
* Benchmark saving dates/timestamps before and after 1582-10-15
 * Benchmark loading dates/timestamps

  was:
* Add benchmarks for saving dates/timestamps to parquet when 
spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
* Add bechmark for loading dates/timestamps from parquet when rebasing is on


> Benchmark date-time rebasing in ORC datasource
> --
>
> Key: SPARK-31311
> URL: https://issues.apache.org/jira/browse/SPARK-31311
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> * Benchmark saving dates/timestamps before and after 1582-10-15
>  * Benchmark loading dates/timestamps



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31311) Benchmark date-time rebasing in ORC datasource

2020-03-31 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-31311:
--

 Summary: Benchmark date-time rebasing in ORC datasource
 Key: SPARK-31311
 URL: https://issues.apache.org/jira/browse/SPARK-31311
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.0.0


* Add benchmarks for saving dates/timestamps to parquet when 
spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true
* Add bechmark for loading dates/timestamps from parquet when rebasing is on



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-6761) Approximate quantile

2020-03-31 Thread Shyam (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071490#comment-17071490
 ] 

Shyam commented on SPARK-6761:
--

[~mengxr] can you please advice on this bug... 
https://issues.apache.org/jira/browse/SPARK-31310

> Approximate quantile
> 
>
> Key: SPARK-6761
> URL: https://issues.apache.org/jira/browse/SPARK-6761
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: L. C. Hsieh
>Priority: Major
> Fix For: 2.0.0
>
>
> See mailing list discussion: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Approximate-rank-based-statistics-median-95-th-percentile-etc-for-Spark-td11414.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-31310) percentile_approx function not working as expected

2020-03-31 Thread Shyam (Jira)

Shyam created SPARK-31310:
-

 Summary: percentile_approx function not working as expected
 Key: SPARK-31310
 URL: https://issues.apache.org/jira/browse/SPARK-31310
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.4.3, 2.4.1, 2.4.0
 Environment: park-sql-2.4.1v with Java 8
Reporter: Shyam
 Fix For: 2.4.3


I'm using spark-sql-2.4.1v with Java 8 and I'm trying to do find quantiles, 
i.e. percentile 0, percentile 25, etc, on the given column data of dataframe.

Column values data set is as below 23456.55,34532.55,23456.55

When I use percentile_approx() function the results are not matching to that of 
Excel percentile_inc() function.

Ex :

for above data set i.e. 23456.55,34532.55,23456.55
percentile_0,percentile_10,percentile_25,percentile_50,percentile_75,percentile_90,percentile_100
 respectively 
using percentile_approx() function
23456.55,23456.55,23456.55,23456.55,23456.55,23456.55,23456.55

Using excel i.e. percentile_inc()
23456.55,23456.55,23456.55,23456.55,28994.5503,32317.3502,34532.55
How to get correct percentiles as excel using percentile_approx() function?

For the details please check it.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

70 matches

Mail list logo