[jira] [Resolved] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
[ https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31318. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28082 [https://github.com/apache/spark/pull/28082] > Split Parquet/Avro configs for rebasing dates/timestamps in read and in write > - > > Key: SPARK-31318 > URL: https://issues.apache.org/jira/browse/SPARK-31318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark provides 2 SQL configs to control rebasing of > dates/timestamps in Parquet and Avro datasource: > spark.sql.legacy.parquet.rebaseDateTime.enabled > spark.sql.legacy.avro.rebaseDateTime.enabled > The configs control rebasing in read and in write. That's can be inconvenient > for users who want to read files saved by Spark 2.4 and earlier versions, and > save dates/timestamps without rebasing. > The ticket aims to split the configs, and introduce separate SQL configs for > read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
[ https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31318: --- Assignee: Maxim Gekk > Split Parquet/Avro configs for rebasing dates/timestamps in read and in write > - > > Key: SPARK-31318 > URL: https://issues.apache.org/jira/browse/SPARK-31318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > Currently, Spark provides 2 SQL configs to control rebasing of > dates/timestamps in Parquet and Avro datasource: > spark.sql.legacy.parquet.rebaseDateTime.enabled > spark.sql.legacy.avro.rebaseDateTime.enabled > The configs control rebasing in read and in write. That's can be inconvenient > for users who want to read files saved by Spark 2.4 and earlier versions, and > save dates/timestamps without rebasing. > The ticket aims to split the configs, and introduce separate SQL configs for > read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072386#comment-17072386 ] Dongjoon Hyun commented on SPARK-31308: --- Ya. I fixed it. It was failed from my side at that time. > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31308. --- Fix Version/s: 3.1.0 Assignee: L. C. Hsieh Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/28077/ > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai resolved SPARK-31313. - Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28080 [https://github.com/apache/spark/pull/28080] > Add `m01` node name to support Minikube 1.8.x > - > > Key: SPARK-31313 > URL: https://issues.apache.org/jira/browse/SPARK-31313 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31320) fix release script for 3.0.0
Wenchen Fan created SPARK-31320: --- Summary: fix release script for 3.0.0 Key: SPARK-31320 URL: https://issues.apache.org/jira/browse/SPARK-31320 Project: Spark Issue Type: Bug Components: Project Infra Affects Versions: 3.0.0 Reporter: Wenchen Fan Assignee: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x
[ https://issues.apache.org/jira/browse/SPARK-31313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-31313: --- Assignee: Dongjoon Hyun > Add `m01` node name to support Minikube 1.8.x > - > > Key: SPARK-31313 > URL: https://issues.apache.org/jira/browse/SPARK-31313 > Project: Spark > Issue Type: Improvement > Components: Kubernetes, Tests >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation
[ https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31312: Fix Version/s: 2.4.6 > Transforming Hive simple UDF (using JAR) expression may incur CNFE in later > evaluation > -- > > Key: SPARK-31312 > URL: https://issues.apache.org/jira/browse/SPARK-31312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0, 2.4.6 > > > In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of > current thread context classloader. > [~cloud_fan] pointed out another potential issue in post-review of > SPARK-26560 - quoting the comment: > {quote} > Found a potential problem: here we call HiveSimpleUDF.dateType (which is a > lazy val), to force to load the class with the corrected class loader. > However, if the expression gets transformed later, which copies > HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class > loading, and at that time there is no guarantee that the corrected > classloader is used. > I think we should materialize the loaded class in HiveSimpleUDF. > {quote} > This JIRA issue is to track the effort of verifying the potential issue and > fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072319#comment-17072319 ] zhengruifeng commented on SPARK-31301: -- {code:java} @Since("3.1.0") def testChiSquare( dataset: Dataset[_], featuresCol: String, labelCol: String): Array[SelectionTestResult] {code} this method is newliy added, what about changing the return type to flatten dataframe to support filtering? all ping [~huaxingao] > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31300) Migrate the implementation of algorithms from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-31300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31300. -- Resolution: Not A Problem > Migrate the implementation of algorithms from MLlib to ML > - > > Key: SPARK-31300 > URL: https://issues.apache.org/jira/browse/SPARK-31300 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > Migrate the implementation of algorithms from MLlib to ML and make MLlib a > wrapper of the ML side. Including: > 1, clustering: KMeans, BiKMeans, LDA, PIC > 2, regression: AFT, Isotonic, > 3, stat: ChiSquareTest, Correlation, KolmogorovSmirnovTest, > 4, tree: remove the usage of OldAlgo/OldStrategy/etc > 5, feature: Word2Vec, PCA > 6, evaluation: > 7, fpm: FPGrowth, PrefixSpan > 8, linalg, MLUtils, etc -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31309) Migrate the ChiSquareTest from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-31309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31309. -- Resolution: Not A Problem > Migrate the ChiSquareTest from MLlib to ML > -- > > Key: SPARK-31309 > URL: https://issues.apache.org/jira/browse/SPARK-31309 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > > Move the impl of ChiSq from .mllib to the .ml side, and make .mllib.ChiSq a > wrapper of the .ml side. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072303#comment-17072303 ] L. C. Hsieh commented on SPARK-31308: - Not sure why this JIRA ticket is not closed automatically. [~dongjoon] > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072300#comment-17072300 ] L. C. Hsieh commented on SPARK-29358: - > users might rely on the exception when the schema is mismatched This is what I concerned as behavior change. Maybe it is minor. I'm also good with this. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31290) Add back the deprecated R APIs
[ https://issues.apache.org/jira/browse/SPARK-31290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31290. -- Fix Version/s: 3.0.0 Resolution: Fixed Fixed in https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f > Add back the deprecated R APIs > -- > > Key: SPARK-31290 > URL: https://issues.apache.org/jira/browse/SPARK-31290 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Add back the deprecated R APIs removed by > https://github.com/apache/spark/pull/22843/ > https://github.com/apache/spark/pull/22815 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31290) Add back the deprecated R APIs
[ https://issues.apache.org/jira/browse/SPARK-31290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072299#comment-17072299 ] Hyukjin Kwon edited comment on SPARK-31290 at 4/1/20, 1:43 AM: --- Fixed in https://github.com/apache/spark/pull/28058 was (Author: hyukjin.kwon): Fixed in https://github.com/apache/spark/commit/fd0b2281272daba590c6bb277688087d0b26053f > Add back the deprecated R APIs > -- > > Key: SPARK-31290 > URL: https://issues.apache.org/jira/browse/SPARK-31290 > Project: Spark > Issue Type: Sub-task > Components: SparkR, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > > Add back the deprecated R APIs removed by > https://github.com/apache/spark/pull/22843/ > https://github.com/apache/spark/pull/22815 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072294#comment-17072294 ] Hyukjin Kwon commented on SPARK-29358: -- I am good with adding this behaviour if other committers think it's useful. Yeah, it's really extending rather than breaking. My concern was: - users might rely on the exception when the schema is mismatched - workarounds might be possible with other combinations of APIs After rethinking about this, the former could be guarded by a parameter or switch. The latter concern would need to extend the functionality of {{StructType.merge}} so I am not so against it. Let me reopen the ticket. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-29358: -- > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29358: - Affects Version/s: (was: 3.0.0) 3.1.0 > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31319) Document UDF in SQL Reference
Huaxin Gao created SPARK-31319: -- Summary: Document UDF in SQL Reference Key: SPARK-31319 URL: https://issues.apache.org/jira/browse/SPARK-31319 Project: Spark Issue Type: Sub-task Components: Documentation, SQL Affects Versions: 3.0.0 Reporter: Huaxin Gao Document UDF in SQL Reference -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31248) Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove
[ https://issues.apache.org/jira/browse/SPARK-31248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-31248: - Assignee: wuyi > Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove > -- > > Key: SPARK-31248 > URL: https://issues.apache.org/jira/browse/SPARK-31248 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Assignee: wuyi >Priority: Major > Fix For: 3.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120300/testReport/ > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 12 did > not equal 8 > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > at > org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$51(ExecutorAllocationManagerSuite.scala:864) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31248) Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove
[ https://issues.apache.org/jira/browse/SPARK-31248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31248. --- Fix Version/s: 3.1.0 Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/28084 . > Flaky Test: ExecutorAllocationManagerSuite.interleaving add and remove > -- > > Key: SPARK-31248 > URL: https://issues.apache.org/jira/browse/SPARK-31248 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 3.1.0 >Reporter: Hyukjin Kwon >Priority: Major > Fix For: 3.1.0 > > > https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120300/testReport/ > {code} > sbt.ForkMain$ForkError: org.scalatest.exceptions.TestFailedException: 12 did > not equal 8 > at > org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530) > at > org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529) > at > org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560) > at > org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503) > at > org.apache.spark.ExecutorAllocationManagerSuite.$anonfun$new$51(ExecutorAllocationManagerSuite.scala:864) > at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > at org.scalatest.Transformer.apply(Transformer.scala:22) > at org.scalatest.Transformer.apply(Transformer.scala:20) > at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186) > at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:151) > at > org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31305) Add a page to list all commands in SQL Reference
[ https://issues.apache.org/jira/browse/SPARK-31305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31305. -- Fix Version/s: 3.0.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by [https://github.com/apache/spark/pull/28074] > Add a page to list all commands in SQL Reference > > > Key: SPARK-31305 > URL: https://issues.apache.org/jira/browse/SPARK-31305 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > > Add a page to list all commands in SQL Reference, so it will be easier for > user to find a specific command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17592) SQL: CAST string as INT inconsistent with Hive
[ https://issues.apache.org/jira/browse/SPARK-17592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-17592. - Resolution: Won't Fix > SQL: CAST string as INT inconsistent with Hive > -- > > Key: SPARK-17592 > URL: https://issues.apache.org/jira/browse/SPARK-17592 > Project: Spark > Issue Type: Bug >Affects Versions: 2.0.0 >Reporter: Furcy Pin >Priority: Major > Attachments: image-2018-05-24-17-10-24-515.png > > > Hello, > there seem to be an inconsistency between Spark and Hive when casting a > string into an Int. > With Hive: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 0 > select cast("0.6" as INT) ; > > 0 > {code} > With Spark-SQL: > {code} > select cast("0.4" as INT) ; > > 0 > select cast("0.5" as INT) ; > > 1 > select cast("0.6" as INT) ; > > 1 > {code} > Hive seems to perform a floor(string.toDouble), while Spark seems to perform > a round(string.toDouble) > I'm not sure there is any ISO standard for this, mysql has the same behavior > than Hive, while postgresql performs a string.toInt and throws an > NumberFormatException > Personnally I think Hive is right, hence my posting this here. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072212#comment-17072212 ] fqaiser94 commented on SPARK-22231: --- Makes sense, thanks! > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > // +---++--+ > // |10 |10.0|[[10,11.0], [11,12.0]]| > // |20 |20.0|[[20,21.0], [21,22.0]]| > // +---++--+ > {code}
[jira] [Resolved] (SPARK-31304) Add ANOVATest example
[ https://issues.apache.org/jira/browse/SPARK-31304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31304. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28073 [https://github.com/apache/spark/pull/28073] > Add ANOVATest example > - > > Key: SPARK-31304 > URL: https://issues.apache.org/jira/browse/SPARK-31304 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: kevin yu >Priority: Minor > Fix For: 3.1.0 > > > Add ANOVATest examples (Scala, Java and Python) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31304) Add ANOVATest example
[ https://issues.apache.org/jira/browse/SPARK-31304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-31304: Assignee: kevin yu > Add ANOVATest example > - > > Key: SPARK-31304 > URL: https://issues.apache.org/jira/browse/SPARK-31304 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: kevin yu >Priority: Minor > > Add ANOVATest examples (Scala, Java and Python) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072142#comment-17072142 ] DB Tsai commented on SPARK-17636: - [~MasterDDT] we are not able to merge a new feature to an old release. Being said that, we internally at Apple use this for two years in prod without issues. You can backport it if you want to use it in your env. > Parquet predicate pushdown for nested fields > > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Minor > Fix For: 3.0.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31253) add metrics to shuffle reader
[ https://issues.apache.org/jira/browse/SPARK-31253?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-31253. - Fix Version/s: 3.1.0 Resolution: Fixed > add metrics to shuffle reader > - > > Key: SPARK-31253 > URL: https://issues.apache.org/jira/browse/SPARK-31253 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17636) Parquet predicate pushdown for nested fields
[ https://issues.apache.org/jira/browse/SPARK-17636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072119#comment-17072119 ] Mitesh commented on SPARK-17636: [~cloud_fan] [~dbtsai] thanks for fixing! Is there any plan to backport to 2.4x branch? > Parquet predicate pushdown for nested fields > > > Key: SPARK-17636 > URL: https://issues.apache.org/jira/browse/SPARK-17636 > Project: Spark > Issue Type: Sub-task > Components: Spark Core, SQL >Affects Versions: 1.6.2, 1.6.3, 2.0.2 >Reporter: Mitesh >Assignee: DB Tsai >Priority: Minor > Fix For: 3.0.0 > > > There's a *PushedFilters* for a simple numeric field, but not for a numeric > field inside a struct. Not sure if this is a Spark limitation because of > Parquet, or only a Spark limitation. > {noformat} > scala> hc.read.parquet("s3a://some/parquet/file").select("day_timestamp", > "sale_id") > res5: org.apache.spark.sql.DataFrame = [day_timestamp: > struct, sale_id: bigint] > scala> res5.filter("sale_id > 4").queryExecution.executedPlan > res9: org.apache.spark.sql.execution.SparkPlan = > Filter[23814] [args=(sale_id#86324L > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file, PushedFilters: [GreaterThan(sale_id,4)] > scala> res5.filter("day_timestamp.timestamp > 4").queryExecution.executedPlan > res10: org.apache.spark.sql.execution.SparkPlan = > Filter[23815] [args=(day_timestamp#86302.timestamp > > 4)][outPart=UnknownPartitioning(0)][outOrder=List()] > +- Scan ParquetRelation[day_timestamp#86302,sale_id#86324L] InputPaths: > s3a://some/parquet/file > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31265) Add -XX:MaxDirectMemorySize jvm options in yarn mode
[ https://issues.apache.org/jira/browse/SPARK-31265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-31265. --- Resolution: Invalid > Add -XX:MaxDirectMemorySize jvm options in yarn mode > > > Key: SPARK-31265 > URL: https://issues.apache.org/jira/browse/SPARK-31265 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 3.0.0 >Reporter: wangzhun >Priority: Minor > > Current memory composition `amMemory` + `amMemoryOverhead` > {code:java} > val capability = Records.newRecord(classOf[Resource]) > capability.setMemory(amMemory + amMemoryOverhead) > capability.setVirtualCores(amCores) > if (amResources.nonEmpty) { > ResourceRequestHelper.setResourceRequests(amResources, capability) > } > logDebug(s"Created resource capability for AM request: $capability") > {code} > {code:java} > // Add Xmx for AM memory > javaOpts += "-Xmx" + amMemory + "m" > {code} > It is possible that the physical memory of the container exceeds the limit > and is killed by yarn. > I suggest setting `-XX:MaxDirectMemorySize` here > {code:java} > // Add Xmx for AM memory > javaOpts += "-Xmx" + amMemory + "m" > javaOpts += s"-XX:MaxDirectMemorySize=${amMemoryOverhead}m"{code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
[ https://issues.apache.org/jira/browse/SPARK-31318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31318: --- Parent: SPARK-30951 Issue Type: Sub-task (was: Improvement) > Split Parquet/Avro configs for rebasing dates/timestamps in read and in write > - > > Key: SPARK-31318 > URL: https://issues.apache.org/jira/browse/SPARK-31318 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > Currently, Spark provides 2 SQL configs to control rebasing of > dates/timestamps in Parquet and Avro datasource: > spark.sql.legacy.parquet.rebaseDateTime.enabled > spark.sql.legacy.avro.rebaseDateTime.enabled > The configs control rebasing in read and in write. That's can be inconvenient > for users who want to read files saved by Spark 2.4 and earlier versions, and > save dates/timestamps without rebasing. > The ticket aims to split the configs, and introduce separate SQL configs for > read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31318) Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
Maxim Gekk created SPARK-31318: -- Summary: Split Parquet/Avro configs for rebasing dates/timestamps in read and in write Key: SPARK-31318 URL: https://issues.apache.org/jira/browse/SPARK-31318 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Currently, Spark provides 2 SQL configs to control rebasing of dates/timestamps in Parquet and Avro datasource: spark.sql.legacy.parquet.rebaseDateTime.enabled spark.sql.legacy.avro.rebaseDateTime.enabled The configs control rebasing in read and in write. That's can be inconvenient for users who want to read files saved by Spark 2.4 and earlier versions, and save dates/timestamps without rebasing. The ticket aims to split the configs, and introduce separate SQL configs for read and for write. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30775) Improve DOC for executor metrics
[ https://issues.apache.org/jira/browse/SPARK-30775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-30775: -- Fix Version/s: 3.0.1 > Improve DOC for executor metrics > > > Key: SPARK-30775 > URL: https://issues.apache.org/jira/browse/SPARK-30775 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Luca Canali >Assignee: Luca Canali >Priority: Minor > Fix For: 3.1.0, 3.0.1 > > > This aims to improve the description of the executor metrics in the > monitoring documentation. In particular by: > - adding reference to the Prometheus end point, as implemented in SPARK-29064 > - extending the list and descripion of executor metrics, following up from > the work in SPARK-27157 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071969#comment-17071969 ] DB Tsai commented on SPARK-22231: - [~fqaiser94] Thanks for continuing this work. We implemented this feature while I was at Netflix, and it's ready useful for end users to manipulate nested dataframe. Currently, we try to not assign the ticket to prevent someone is being assigned but not works on it. Therefore, I unassigned this JIRA. In your PR, you implement the first part, and I create a sub-task for it. https://issues.apache.org/jira/browse/SPARK-31317 Can you change the Jira number to SPARK-31317 to have it properly linked? > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > //
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071968#comment-17071968 ] Michael Armbrust commented on SPARK-29358: -- I think we should reconsider closing this as won't fix: - I think the semantics of this operation make sense. We already have this with writing JSON or parquet data. It is just a really inefficient way to accomplish the end goal. - I don't think it is a problem to move "away from SQL union". This is a clearly named, different operation. IMO this one makes *more* sense than SQL union. It is much more likely that columns with the same name are semantically equivalent than columns at the same ordinal with different names. - We are not breaking the behavior of unionByName. Currently it throws an exception in these cases. We are making more data transformations possible, but anything that was working before will continue to work. You could add a boolean flag if you were really concerned, but I think I would skip that. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31317) Add withField method to Column class
DB Tsai created SPARK-31317: --- Summary: Add withField method to Column class Key: SPARK-31317 URL: https://issues.apache.org/jira/browse/SPARK-31317 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3 Reporter: DB Tsai -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] DB Tsai reassigned SPARK-22231: --- Assignee: (was: Jeremy Smith) > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // |||-- b: double (nullable = true) > result.show(false) > // +---++--+ > // |foo|bar |items | > // +---++--+ > // |10 |10.0|[[10,11.0], [11,12.0]]| > // |20 |20.0|[[20,21.0], [21,22.0]]| > // +---++--+ > {code} > and the second one
[jira] [Commented] (SPARK-30873) Handling Node Decommissioning for Yarn cluster manger in Spark
[ https://issues.apache.org/jira/browse/SPARK-30873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071953#comment-17071953 ] Thomas Graves commented on SPARK-30873: --- so I think this is a dup of SPARK-30835. > Handling Node Decommissioning for Yarn cluster manger in Spark > -- > > Key: SPARK-30873 > URL: https://issues.apache.org/jira/browse/SPARK-30873 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 3.1.0 >Reporter: Saurabh Chawla >Priority: Major > > In many public cloud environments, the node loss (in case of AWS > SpotLoss,Spot blocks and GCP preemptible VMs) is a planned and informed > activity. > The cloud provider intimates the cluster manager about the possible loss of > node ahead of time. Few examples is listed here: > a) Spot loss in AWS(2 min before event) > b) GCP Pre-emptible VM loss (30 second before event) > c) AWS Spot block loss with info on termination time (generally few tens of > minutes before decommission as configured in Yarn) > This JIRA tries to make spark leverage the knowledge of the node loss in > future, and tries to adjust the scheduling of tasks to minimise the impact on > the application. > It is well known that when a host is lost, the executors, its running tasks, > their caches and also Shuffle data is lost. This could result in wastage of > compute and other resources. > The focus here is to build a framework for YARN, that can be extended for > other cluster managers to handle such scenario. > The framework must handle one or more of the following:- > 1) Prevent new tasks from starting on any executors on decommissioning Nodes. > 2) Decide to kill the running tasks so that they can be restarted elsewhere > (assuming they will not complete within the deadline) OR we can allow them to > continue hoping they will finish within deadline. > 3) Clear the shuffle data entry from MapOutputTracker of decommission node > hostname to prevent the shuffle fetchfailed exception.The most significant > advantage of unregistering shuffle outputs when Spark schedules the first > re-attempt to compute the missing blocks, it notices all of the missing > blocks from decommissioned nodes and recovers in only one attempt. This > speeds up the recovery process significantly over the scheduled Spark > implementation, where stages might be rescheduled multiple times to recompute > missing shuffles from all nodes, and prevent jobs from being stuck for hours > failing and recomputing. > 4) Prevent the stage to abort due to the fetchfailed exception in case of > decommissioning of node. In Spark there is number of consecutive stage > attempts allowed before a stage is aborted.This is controlled by the config > spark.stage.maxConsecutiveAttempts. Not accounting fetch fails due > decommissioning of nodes towards stage failure improves the reliability of > the system. > Main components of change > 1) Get the ClusterInfo update from the Resource Manager -> Application Master > -> Spark Driver. > 2) DecommissionTracker, resides inside driver, tracks all the decommissioned > nodes and take necessary action and state transition. > 3) Based on the decommission node list add hooks at code to achieve > a) No new task on executor > b) Remove shuffle data mapping info for the node to be decommissioned from > the mapOutputTracker > c) Do not count fetchFailure from decommissioned towards stage failure > On the receiving info that node is to be decommissioned, the below action > needs to be performed by DecommissionTracker on driver: > * Add the entry of Nodes in DecommissionTracker with termination time and > node state as "DECOMMISSIONING". > * Stop assigning any new tasks on executors on the nodes which are candidate > for decommission. This makes sure slowly as the tasks finish the usage of > this node would die down. > * Kill all the executors for the decommissioning nodes after configurable > period of time, say "spark.graceful.decommission.executor.leasetimePct". This > killing ensures two things. Firstly, the task failure will be attributed in > job failure count. Second, avoid generation on more shuffle data on the node > that will eventually be lost. The node state is set to > "EXECUTOR_DECOMMISSIONED". > * Mark Shuffle data on the node as unavailable after > "spark.qubole.graceful.decommission.shuffedata.leasetimePct" time. This will > ensure that recomputation of missing shuffle partition is done early, rather > than reducers failing with a time-consuming FetchFailure. The node state is > set to "SHUFFLE_DECOMMISSIONED". > * Mark Node as Terminated after the termination time. Now the state of the > node is "TERMINATED". > * Remove the node entry from Decommission Tracker if the
[jira] [Updated] (SPARK-31230) use statement plans in DataFrameWriter(V2)
[ https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-31230: Fix Version/s: (was: 3.0.0) 3.0.1 > use statement plans in DataFrameWriter(V2) > -- > > Key: SPARK-31230 > URL: https://issues.apache.org/jira/browse/SPARK-31230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation
[ https://issues.apache.org/jira/browse/SPARK-31312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31312. - Fix Version/s: 3.0.0 Assignee: Jungtaek Lim Resolution: Fixed > Transforming Hive simple UDF (using JAR) expression may incur CNFE in later > evaluation > -- > > Key: SPARK-31312 > URL: https://issues.apache.org/jira/browse/SPARK-31312 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5, 3.0.0, 3.1.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of > current thread context classloader. > [~cloud_fan] pointed out another potential issue in post-review of > SPARK-26560 - quoting the comment: > {quote} > Found a potential problem: here we call HiveSimpleUDF.dateType (which is a > lazy val), to force to load the class with the corrected class loader. > However, if the expression gets transformed later, which copies > HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class > loading, and at that time there is no guarantee that the corrected > classloader is used. > I think we should materialize the loaded class in HiveSimpleUDF. > {quote} > This JIRA issue is to track the effort of verifying the potential issue and > fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22231) Support of map, filter, withColumn, dropColumn in nested list of structures
[ https://issues.apache.org/jira/browse/SPARK-22231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071945#comment-17071945 ] fqaiser94 commented on SPARK-22231: --- Excellent, thanks Reynold. If somebody could assign this ticket to me, that would be great as I have the code ready for all 3 methods. Pull request for the first one ({{withField}}) is ready here: [https://github.com/apache/spark/pull/27066] Looking forward to the reviews. > Support of map, filter, withColumn, dropColumn in nested list of structures > --- > > Key: SPARK-22231 > URL: https://issues.apache.org/jira/browse/SPARK-22231 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: DB Tsai >Assignee: Jeremy Smith >Priority: Major > > At Netflix's algorithm team, we work on ranking problems to find the great > content to fulfill the unique tastes of our members. Before building a > recommendation algorithms, we need to prepare the training, testing, and > validation datasets in Apache Spark. Due to the nature of ranking problems, > we have a nested list of items to be ranked in one column, and the top level > is the contexts describing the setting for where a model is to be used (e.g. > profiles, country, time, device, etc.) Here is a blog post describing the > details, [Distributed Time Travel for Feature > Generation|https://medium.com/netflix-techblog/distributed-time-travel-for-feature-generation-389cccdd3907]. > > To be more concrete, for the ranks of videos for a given profile_id at a > given country, our data schema can be looked like this, > {code:java} > root > |-- profile_id: long (nullable = true) > |-- country_iso_code: string (nullable = true) > |-- items: array (nullable = false) > ||-- element: struct (containsNull = false) > |||-- title_id: integer (nullable = true) > |||-- scores: double (nullable = true) > ... > {code} > We oftentimes need to work on the nested list of structs by applying some > functions on them. Sometimes, we're dropping or adding new columns in the > nested list of structs. Currently, there is no easy solution in open source > Apache Spark to perform those operations using SQL primitives; many people > just convert the data into RDD to work on the nested level of data, and then > reconstruct the new dataframe as workaround. This is extremely inefficient > because all the optimizations like predicate pushdown in SQL can not be > performed, we can not leverage on the columnar format, and the serialization > and deserialization cost becomes really huge even we just want to add a new > column in the nested level. > We built a solution internally at Netflix which we're very happy with. We > plan to make it open source in Spark upstream. We would like to socialize the > API design to see if we miss any use-case. > The first API we added is *mapItems* on dataframe which take a function from > *Column* to *Column*, and then apply the function on nested dataframe. Here > is an example, > {code:java} > case class Data(foo: Int, bar: Double, items: Seq[Double]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(10.1, 10.2, 10.3, 10.4)), > Data(20, 20.0, Seq(20.1, 20.2, 20.3, 20.4)) > )) > val result = df.mapItems("items") { > item => item * 2.0 > } > result.printSchema() > // root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: double (containsNull = true) > result.show() > // +---+++ > // |foo| bar| items| > // +---+++ > // | 10|10.0|[20.2, 20.4, 20.6...| > // | 20|20.0|[40.2, 40.4, 40.6...| > // +---+++ > {code} > Now, with the ability of applying a function in the nested dataframe, we can > add a new function, *withColumn* in *Column* to add or replace the existing > column that has the same name in the nested list of struct. Here is two > examples demonstrating the API together with *mapItems*; the first one > replaces the existing column, > {code:java} > case class Item(a: Int, b: Double) > case class Data(foo: Int, bar: Double, items: Seq[Item]) > val df: Dataset[Data] = spark.createDataset(Seq( > Data(10, 10.0, Seq(Item(10, 10.0), Item(11, 11.0))), > Data(20, 20.0, Seq(Item(20, 20.0), Item(21, 21.0))) > )) > val result = df.mapItems("items") { > item => item.withColumn(item("b") + 1 as "b") > } > result.printSchema > root > // |-- foo: integer (nullable = false) > // |-- bar: double (nullable = false) > // |-- items: array (nullable = true) > // ||-- element: struct (containsNull = true) > // |||-- a: integer (nullable = true) > // ||
[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071936#comment-17071936 ] L. C. Hsieh commented on SPARK-31308: - Thanks [~dongjoon]. I changed the `Affected Version`. > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] L. C. Hsieh updated SPARK-31308: Affects Version/s: (was: 3.0.0) 3.1.0 > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.1.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31230) use statement plans in DataFrameWriter(V2)
[ https://issues.apache.org/jira/browse/SPARK-31230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31230. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 27992 [https://github.com/apache/spark/pull/27992] > use statement plans in DataFrameWriter(V2) > -- > > Key: SPARK-31230 > URL: https://issues.apache.org/jira/browse/SPARK-31230 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31308) Make Python dependencies available for Non-PySpark applications
[ https://issues.apache.org/jira/browse/SPARK-31308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071928#comment-17071928 ] Dongjoon Hyun commented on SPARK-31308: --- Yes. For `New Feature` or `Improvement`, we should use `master` branch version because we don't backport. The current snapshot version of `master` branch is `3.1.0-SNAPSHOT`. So, you need to use `3.1.0`. > Make Python dependencies available for Non-PySpark applications > --- > > Key: SPARK-31308 > URL: https://issues.apache.org/jira/browse/SPARK-31308 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Submit >Affects Versions: 3.0.0 >Reporter: L. C. Hsieh >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071926#comment-17071926 ] Dongjoon Hyun commented on SPARK-25102: --- Thanks, [~cloud_fan]. > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31072) Default to ParquetOutputCommitter even after configuring s3a committer as "partitioned"
[ https://issues.apache.org/jira/browse/SPARK-31072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071819#comment-17071819 ] Felix Kizhakkel Jose commented on SPARK-31072: -- Hello, Any updates or helps will be much appreciated. > Default to ParquetOutputCommitter even after configuring s3a committer as > "partitioned" > --- > > Key: SPARK-31072 > URL: https://issues.apache.org/jira/browse/SPARK-31072 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.4.5 >Reporter: Felix Kizhakkel Jose >Priority: Major > > My program logs says it uses ParquetOutputCommitter when I use _*"Parquet"*_ > even after I configure to use "PartitionedStagingCommitter" with the > following configuration: > * > sparkSession.conf().set("spark.hadoop.mapreduce.outputcommitter.factory.scheme.s3a", > "org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory"); > * sparkSession.conf().set("fs.s3a.committer.name", "partitioned"); > * sparkSession.conf().set("fs.s3a.committer.staging.conflict-mode", > "append"); > * sparkSession.conf().set("spark.hadoop.parquet.mergeSchema", "false"); > * sparkSession.conf().set("spark.hadoop.parquet.enable.summary-metadata", > false); > Application logs stacktrace: > 20/03/06 10:15:17 INFO ParquetFileFormat: Using default output committer for > Parquet: org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using user defined > output committer class org.apache.parquet.hadoop.ParquetOutputCommitter > 20/03/06 10:15:17 INFO FileOutputCommitter: File Output Committer Algorithm > version is 2 > 20/03/06 10:15:17 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/06 10:15:17 INFO SQLHadoopMapReduceCommitProtocol: Using output > committer class org.apache.parquet.hadoop.ParquetOutputCommitter > But when I use _*ORC*_ as the file format, with the same configuration as > above it correctly pick "PartitionedStagingCommitter": > 20/03/05 11:51:14 INFO FileOutputCommitter: File Output Committer Algorithm > version is 1 > 20/03/05 11:51:14 INFO FileOutputCommitter: FileOutputCommitter skip cleanup > _temporary folders under output directory:false, ignore cleanup failures: > false > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using committer > partitioned to output data to s3a: > 20/03/05 11:51:14 INFO AbstractS3ACommitterFactory: Using Commmitter > PartitionedStagingCommitter** > So I am wondering why Parquet and ORC has different behavior ? > How can I use PartitionedStagingCommitter instead of ParquetOutputCommitter? > I started this because when I was trying to save data to S3 directly with > partitionBy() two columns - I was getting file not found exceptions > intermittently. > So how could I avoid this issue with *Parquet using Spark to S3 using s3A > without s3aGuard?* -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31301) flatten the result dataframe of tests in stat
[ https://issues.apache.org/jira/browse/SPARK-31301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071807#comment-17071807 ] Sean R. Owen commented on SPARK-31301: -- Hm, I'm not sure I'd add a separate method for this either. We'd have three types of return values then, just adding to the complexity. > flatten the result dataframe of tests in stat > - > > Key: SPARK-31301 > URL: https://issues.apache.org/jira/browse/SPARK-31301 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Major > > {code:java} > scala> import org.apache.spark.ml.linalg.{Vector, Vectors} > import org.apache.spark.ml.linalg.{Vector, Vectors}scala> import > org.apache.spark.ml.stat.ChiSquareTest > import org.apache.spark.ml.stat.ChiSquareTestscala> val data = Seq( > | (0.0, Vectors.dense(0.5, 10.0)), > | (0.0, Vectors.dense(1.5, 20.0)), > | (1.0, Vectors.dense(1.5, 30.0)), > | (0.0, Vectors.dense(3.5, 30.0)), > | (0.0, Vectors.dense(3.5, 40.0)), > | (1.0, Vectors.dense(3.5, 40.0)) > | ) > data: Seq[(Double, org.apache.spark.ml.linalg.Vector)] = > List((0.0,[0.5,10.0]), (0.0,[1.5,20.0]), (1.0,[1.5,30.0]), (0.0,[3.5,30.0]), > (0.0,[3.5,40.0]), (1.0,[3.5,40.0]))scala> scala> scala> val df = > data.toDF("label", "features") > df: org.apache.spark.sql.DataFrame = [label: double, features: vector]scala> >val chi = ChiSquareTest.test(df, "features", "label") > chi: org.apache.spark.sql.DataFrame = [pValues: vector, degreesOfFreedom: > array ... 1 more field]scala> chi.show > +++--+ > | pValues|degreesOfFreedom|statistics| > +++--+ > |[0.68728927879097...| [2, 3]|[0.75,1.5]| > +++--+{code} > > Current impls of {{ChiSquareTest}}, {{ANOVATest}}, {{FValueTest}}, > {{Correlation}} all return a df only containing one row. > I think this is quite hard to use, suppose we have a dataset with dim=1000, > the only operation we can deal with the test result is to collect it by > {{head()}} or {{first(), and then use it in the driver.}} > {{While what I really want to do is filtering the df like pValue>0.1}} or > {{corr<0.5}}, *So I suggest to flatten the output df in those tests.* > > {{note: {{ANOVATest}}{{ and\{{FValueTest}} are newly added in 3.1.0, but > ChiSquareTest and Correlation were here for a long time. > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071803#comment-17071803 ] Wenchen Fan commented on SPARK-25102: - ok nvm, I checked ORC and it doesn't have the "createdBy" field. Let's keep using this consistent way to record spark version in parquet/orc. > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
[ https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves updated SPARK-31314: -- Issue Type: Bug (was: Improvement) > Revert SPARK-29285 to fix shuffle regression caused by creating temporary > file eagerly > -- > > Key: SPARK-31314 > URL: https://issues.apache.org/jira/browse/SPARK-31314 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > In SPARK-29285, we change to create shuffle temporary eagerly. This is > helpful for not to fail the entire task in the scenario of occasional disk > failure. > But for the applications that many tasks don't actually create shuffle files, > it caused overhead. See the below benchmark: > Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 > times. > Data: TPC-DS scale=99 generate by spark-tpcds-datagen > Results: > || ||Base||Revert|| > |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) > Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, > 2.224627274) Median 2.586498463| > |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) > Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, > 3.783188024) Median 4.082311276| > |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) > Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, > 2.606163423) Median 3.196025108| > |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) > Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, > 3.657525982) Median 4.195202502| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.
[ https://issues.apache.org/jira/browse/SPARK-31316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071711#comment-17071711 ] jiaan.geng commented on SPARK-31316: I'm working on. > SQLQueryTestSuite: Display the total generate time for generated java code. > --- > > Key: SPARK-31316 > URL: https://issues.apache.org/jira/browse/SPARK-31316 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: jiaan.geng >Priority: Major > > SQLQueryTestSuite spent a lot of time generate java code when using whole > codeine. > We should display the total generate time for generated java code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31316) SQLQueryTestSuite: Display the total generate time for generated java code.
jiaan.geng created SPARK-31316: -- Summary: SQLQueryTestSuite: Display the total generate time for generated java code. Key: SPARK-31316 URL: https://issues.apache.org/jira/browse/SPARK-31316 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng SQLQueryTestSuite spent a lot of time generate java code when using whole codeine. We should display the total generate time for generated java code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures
[ https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-29285: Fix Version/s: (was: 3.0.0) > Temporary shuffle and local block should be able to handle disk failures > > > Key: SPARK-29285 > URL: https://issues.apache.org/jira/browse/SPARK-29285 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > > {code:java} > java.io.FileNotFoundException: > /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f > (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Local or temp shuffle files are initialized without checking because the > getFile method in DiskBlockManager probably return an existing subdirectory. > Sometimes, when a disk failure occurs, those files may become inaccessible > and throw FileNotFoundException later, which may fail the entire task. Task > re-running is a bit heavy for these errors, we may give another or more disks > a try at least. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures
[ https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071699#comment-17071699 ] Wenchen Fan commented on SPARK-29285: - This is reverted in https://github.com/apache/spark/pull/28072 > Temporary shuffle and local block should be able to handle disk failures > > > Key: SPARK-29285 > URL: https://issues.apache.org/jira/browse/SPARK-29285 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > java.io.FileNotFoundException: > /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f > (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Local or temp shuffle files are initialized without checking because the > getFile method in DiskBlockManager probably return an existing subdirectory. > Sometimes, when a disk failure occurs, those files may become inaccessible > and throw FileNotFoundException later, which may fail the entire task. Task > re-running is a bit heavy for these errors, we may give another or more disks > a try at least. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-29285) Temporary shuffle and local block should be able to handle disk failures
[ https://issues.apache.org/jira/browse/SPARK-29285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reopened SPARK-29285: - Assignee: (was: Kent Yao) > Temporary shuffle and local block should be able to handle disk failures > > > Key: SPARK-29285 > URL: https://issues.apache.org/jira/browse/SPARK-29285 > Project: Spark > Issue Type: Improvement > Components: Shuffle >Affects Versions: 3.0.0 >Reporter: Kent Yao >Priority: Major > Fix For: 3.0.0 > > > {code:java} > java.io.FileNotFoundException: > /mnt/dfs/4/yarn/local/usercache/da_haitao/appcache/application_1568691584183_1953115/blockmgr-cc4689f5-eddd-4b99-8af4-4166a86ec30b/10/temp_shuffle_79be5049-d1d5-4a81-8e67-4ef236d3834f > (No such file or directory) > at java.io.FileOutputStream.open0(Native Method) > at java.io.FileOutputStream.open(FileOutputStream.java:270) > at java.io.FileOutputStream.(FileOutputStream.java:213) > at > org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103) > at > org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116) > at > org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:209) > at > org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:416) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230) > at > org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:109) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > Local or temp shuffle files are initialized without checking because the > getFile method in DiskBlockManager probably return an existing subdirectory. > Sometimes, when a disk failure occurs, those files may become inaccessible > and throw FileNotFoundException later, which may fail the entire task. Task > re-running is a bit heavy for these errors, we may give another or more disks > a try at least. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
[ https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31314. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28072 [https://github.com/apache/spark/pull/28072] > Revert SPARK-29285 to fix shuffle regression caused by creating temporary > file eagerly > -- > > Key: SPARK-31314 > URL: https://issues.apache.org/jira/browse/SPARK-31314 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > Fix For: 3.0.0 > > > In SPARK-29285, we change to create shuffle temporary eagerly. This is > helpful for not to fail the entire task in the scenario of occasional disk > failure. > But for the applications that many tasks don't actually create shuffle files, > it caused overhead. See the below benchmark: > Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 > times. > Data: TPC-DS scale=99 generate by spark-tpcds-datagen > Results: > || ||Base||Revert|| > |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) > Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, > 2.224627274) Median 2.586498463| > |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) > Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, > 3.783188024) Median 4.082311276| > |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) > Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, > 2.606163423) Median 3.196025108| > |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) > Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, > 3.657525982) Median 4.195202502| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
[ https://issues.apache.org/jira/browse/SPARK-31314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31314: --- Assignee: Yuanjian Li > Revert SPARK-29285 to fix shuffle regression caused by creating temporary > file eagerly > -- > > Key: SPARK-31314 > URL: https://issues.apache.org/jira/browse/SPARK-31314 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Major > > In SPARK-29285, we change to create shuffle temporary eagerly. This is > helpful for not to fail the entire task in the scenario of occasional disk > failure. > But for the applications that many tasks don't actually create shuffle files, > it caused overhead. See the below benchmark: > Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 > times. > Data: TPC-DS scale=99 generate by spark-tpcds-datagen > Results: > || ||Base||Revert|| > |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) > Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, > 2.224627274) Median 2.586498463| > |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) > Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, > 3.783188024) Median 4.082311276| > |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) > Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, > 2.606163423) Median 3.196025108| > |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) > Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, > 3.657525982) Median 4.195202502| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25102) Write Spark version to ORC/Parquet file metadata
[ https://issues.apache.org/jira/browse/SPARK-25102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071692#comment-17071692 ] Wenchen Fan commented on SPARK-25102: - It's not completely orthogonal as we can merge these two. e.g. set the writer name as `spark-3.0.0` or `spark-2.4.0`. > Write Spark version to ORC/Parquet file metadata > > > Key: SPARK-25102 > URL: https://issues.apache.org/jira/browse/SPARK-25102 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Zoltan Ivanfi >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.0.0 > > > Currently, Spark writes Spark version number into Hive Table properties with > `spark.sql.create.version`. > {code} > parameters:{ > spark.sql.sources.schema.part.0={ > "type":"struct", > "fields":[{"name":"a","type":"integer","nullable":true,"metadata":{}}] > }, > transient_lastDdlTime=1541142761, > spark.sql.sources.schema.numParts=1, > spark.sql.create.version=2.4.0 > } > {code} > This issue aims to write Spark versions to ORC/Parquet file metadata with > `org.apache.spark.sql.create.version`. It's different from Hive Table > property key `spark.sql.create.version`. It seems that we cannot change that > for backward compatibility (even in Apache Spark 3.0) > *ORC* > {code} > User Metadata: > org.apache.spark.sql.create.version=3.0.0-SNAPSHOT > {code} > *PARQUET* > {code} > file: > file:/tmp/p/part-7-9dc415fe-7773-49ba-9c59-4c151e16009a-c000.snappy.parquet > creator: parquet-mr version 1.10.0 (build > 031a6654009e3b82020012a18434c582bd74c73a) > extra: org.apache.spark.sql.create.version = 3.0.0-SNAPSHOT > extra: org.apache.spark.sql.parquet.row.metadata = > {"type":"struct","fields":[{"name":"id","type":"long","nullable":false,"metadata":{}}]} > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31315) SQLQueryTestSuite: Display the total compile time for generated java code.
jiaan.geng created SPARK-31315: -- Summary: SQLQueryTestSuite: Display the total compile time for generated java code. Key: SPARK-31315 URL: https://issues.apache.org/jira/browse/SPARK-31315 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: jiaan.geng SQLQueryTestSuite spent a lot of time compiling the generated java code. We should display the total compile time for generated java code. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31297: --- Assignee: Maxim Gekk > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time = 1981-09-30T13:52:58 diff = 0 minutes > local date-time = 1982-09-30T12:52:58 diff = 60 minutes > local date-time = 1982-09-30T13:52:58 diff = 0 minutes > local date-time = 1983-09-30T12:52:58 diff = 60 minutes > local date-time = 1983-09-30T13:52:58 diff = 0 minutes > local date-time = 1984-09-29T15:52:58 diff = 60 minutes > local date-time = 1984-09-29T16:52:58 diff = 0 minutes > local date-time = 1985-09-28T15:52:58 diff = 60 minutes > local date-time = 1985-09-28T16:52:58 diff = 0 minutes > local date-time = 1986-09-27T15:52:58 diff = 60 minutes > local date-time = 1986-09-27T16:52:58 diff = 0 minutes > local date-time = 1987-09-26T15:52:58 diff = 60 minutes > local date-time = 1987-09-26T16:52:58 diff = 0 minutes > local date-time = 1988-09-24T15:52:58 diff = 60 minutes > local date-time = 1988-09-24T16:52:58 diff = 0 minutes > local date-time = 1989-09-23T15:52:58 diff = 60 minutes > local date-time = 1989-09-23T16:52:58 diff = 0 minutes > local date-time = 1990-09-29T15:52:58 diff = 60 minutes > local date-time = 1990-09-29T16:52:58 diff = 0 minutes > local date-time = 1991-09-28T16:52:58 diff = 60 minutes > local date-time = 1991-09-28T17:52:58 diff = 0 minutes > local date-time = 1992-09-26T15:52:58 diff = 60 minutes > local date-time = 1992-09-26T16:52:58 diff = 0 minutes > local date-time = 1993-09-25T15:52:58 diff = 60 minutes > local date-time = 1993-09-25T16:52:58 diff = 0 minutes > local date-time =
[jira] [Resolved] (SPARK-31297) Speed-up date-time rebasing
[ https://issues.apache.org/jira/browse/SPARK-31297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31297. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28067 [https://github.com/apache/spark/pull/28067] > Speed-up date-time rebasing > --- > > Key: SPARK-31297 > URL: https://issues.apache.org/jira/browse/SPARK-31297 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > I do believe it is possible to speed up date-time rebasing by building a map > of micros to diffs between original and rebased micros. And look up at the > map via binary search. > For example, the *America/Los_Angeles* time zone has less than 100 points > when diff changes: > {code:scala} > test("optimize rebasing") { > val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) > .atZone(getZoneId("America/Los_Angeles")) > .toInstant) > var micros = start > var diff = Long.MaxValue > var counter = 0 > while (micros < end) { > val rebased = rebaseGregorianToJulianMicros(micros) > val curDiff = rebased - micros > if (curDiff != diff) { > counter += 1 > diff = curDiff > val ldt = > microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime > println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} > minutes") > } > micros += MICROS_PER_HOUR > } > println(s"counter = $counter") > } > {code} > {code:java} > local date-time = 0001-01-01T00:00 diff = -2909 minutes > local date-time = 0100-02-28T14:00 diff = -1469 minutes > local date-time = 0200-02-28T14:00 diff = -29 minutes > local date-time = 0300-02-28T14:00 diff = 1410 minutes > local date-time = 0500-02-28T14:00 diff = 2850 minutes > local date-time = 0600-02-28T14:00 diff = 4290 minutes > local date-time = 0700-02-28T14:00 diff = 5730 minutes > local date-time = 0900-02-28T14:00 diff = 7170 minutes > local date-time = 1000-02-28T14:00 diff = 8610 minutes > local date-time = 1100-02-28T14:00 diff = 10050 minutes > local date-time = 1300-02-28T14:00 diff = 11490 minutes > local date-time = 1400-02-28T14:00 diff = 12930 minutes > local date-time = 1500-02-28T14:00 diff = 14370 minutes > local date-time = 1582-10-14T14:00 diff = -29 minutes > local date-time = 1899-12-31T16:52:58 diff = 0 minutes > local date-time = 1917-12-27T11:52:58 diff = 60 minutes > local date-time = 1917-12-27T12:52:58 diff = 0 minutes > local date-time = 1918-09-15T12:52:58 diff = 60 minutes > local date-time = 1918-09-15T13:52:58 diff = 0 minutes > local date-time = 1919-06-30T16:52:58 diff = 31 minutes > local date-time = 1919-06-30T17:52:58 diff = 0 minutes > local date-time = 1919-08-15T12:52:58 diff = 60 minutes > local date-time = 1919-08-15T13:52:58 diff = 0 minutes > local date-time = 1921-08-31T10:52:58 diff = 60 minutes > local date-time = 1921-08-31T11:52:58 diff = 0 minutes > local date-time = 1921-09-30T11:52:58 diff = 60 minutes > local date-time = 1921-09-30T12:52:58 diff = 0 minutes > local date-time = 1922-09-30T12:52:58 diff = 60 minutes > local date-time = 1922-09-30T13:52:58 diff = 0 minutes > local date-time = 1981-09-30T12:52:58 diff = 60 minutes > local date-time = 1981-09-30T13:52:58 diff = 0 minutes > local date-time = 1982-09-30T12:52:58 diff = 60 minutes > local date-time = 1982-09-30T13:52:58 diff = 0 minutes > local date-time = 1983-09-30T12:52:58 diff = 60 minutes > local date-time = 1983-09-30T13:52:58 diff = 0 minutes > local date-time = 1984-09-29T15:52:58 diff = 60 minutes > local date-time = 1984-09-29T16:52:58 diff = 0 minutes > local date-time = 1985-09-28T15:52:58 diff = 60 minutes > local date-time = 1985-09-28T16:52:58 diff = 0 minutes > local date-time = 1986-09-27T15:52:58 diff = 60 minutes > local date-time = 1986-09-27T16:52:58 diff = 0 minutes > local date-time = 1987-09-26T15:52:58 diff = 60 minutes > local date-time = 1987-09-26T16:52:58 diff = 0 minutes > local date-time = 1988-09-24T15:52:58 diff = 60 minutes > local date-time = 1988-09-24T16:52:58 diff = 0 minutes > local date-time = 1989-09-23T15:52:58 diff = 60 minutes > local date-time = 1989-09-23T16:52:58 diff = 0 minutes > local date-time = 1990-09-29T15:52:58 diff = 60 minutes > local date-time = 1990-09-29T16:52:58 diff = 0 minutes > local date-time = 1991-09-28T16:52:58 diff = 60 minutes > local date-time = 1991-09-28T17:52:58 diff = 0 minutes > local date-time = 1992-09-26T15:52:58 diff = 60 minutes > local date-time = 1992-09-26T16:52:58 diff = 0 minutes
[jira] [Created] (SPARK-31314) Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly
Yuanjian Li created SPARK-31314: --- Summary: Revert SPARK-29285 to fix shuffle regression caused by creating temporary file eagerly Key: SPARK-31314 URL: https://issues.apache.org/jira/browse/SPARK-31314 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Yuanjian Li In SPARK-29285, we change to create shuffle temporary eagerly. This is helpful for not to fail the entire task in the scenario of occasional disk failure. But for the applications that many tasks don't actually create shuffle files, it caused overhead. See the below benchmark: Env: Spark local-cluster[2, 4, 19968], each queries run 5 round, each round 5 times. Data: TPC-DS scale=99 generate by spark-tpcds-datagen Results: || ||Base||Revert|| |Q20|Vector(4.096865667, 2.76231748, 2.722007606, 2.514433591, 2.400373579) Median 2.722007606|Vector(3.763185446, 2.586498463, 2.593472842, 2.320522846, 2.224627274) Median 2.586498463| |Q33|Vector(5.872176321, 4.854397586, 4.568787136, 4.393378146, 4.423996818) Median 4.568787136|Vector(5.38746785, 4.361236877, 4.082311276, 3.867206824, 3.783188024) Median 4.082311276| |Q52|Vector(3.978870321, 3.225437871, 3.282411608, 2.869674887, 2.644490664) Median 3.225437871|Vector(4.000381522, 3.196025108, 3.248787619, 2.767444508, 2.606163423) Median 3.196025108| |Q56|Vector(6.238045133, 4.820535173, 4.609965579, 4.313509894, 4.221256227) Median 4.609965579|Vector(6.241611339, 4.225592467, 4.195202502, 3.757085755, 3.657525982) Median 4.195202502| -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31313) Add `m01` node name to support Minikube 1.8.x
Dongjoon Hyun created SPARK-31313: - Summary: Add `m01` node name to support Minikube 1.8.x Key: SPARK-31313 URL: https://issues.apache.org/jira/browse/SPARK-31313 Project: Spark Issue Type: Improvement Components: Kubernetes, Tests Affects Versions: 3.1.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31312) Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation
Jungtaek Lim created SPARK-31312: Summary: Transforming Hive simple UDF (using JAR) expression may incur CNFE in later evaluation Key: SPARK-31312 URL: https://issues.apache.org/jira/browse/SPARK-31312 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.5, 3.0.0, 3.1.0 Reporter: Jungtaek Lim In SPARK-26560, we ensured that Hive UDF using JAR is executed regardless of current thread context classloader. [~cloud_fan] pointed out another potential issue in post-review of SPARK-26560 - quoting the comment: {quote} Found a potential problem: here we call HiveSimpleUDF.dateType (which is a lazy val), to force to load the class with the corrected class loader. However, if the expression gets transformed later, which copies HiveSimpleUDF, then calling HiveSimpleUDF.dataType will re-trigger the class loading, and at that time there is no guarantee that the corrected classloader is used. I think we should materialize the loaded class in HiveSimpleUDF. {quote} This JIRA issue is to track the effort of verifying the potential issue and fixing the issue. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-30879) Refine doc-building workflow
[ https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reopened SPARK-30879: -- Assignee: (was: Nicholas Chammas) Reverted at https://github.com/apache/spark/commit/4d4c3e76f6d1d5ede511c3ff4036b0c458a0a4e3 > Refine doc-building workflow > > > Key: SPARK-30879 > URL: https://issues.apache.org/jira/browse/SPARK-30879 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Priority: Minor > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-30879) Refine doc-building workflow
[ https://issues.apache.org/jira/browse/SPARK-30879?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-30879: - Fix Version/s: (was: 3.1.0) > Refine doc-building workflow > > > Key: SPARK-30879 > URL: https://issues.apache.org/jira/browse/SPARK-30879 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 3.1.0 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas >Priority: Minor > > There are a few rough edges in the workflow for building docs that could be > refined: > * sudo pip installing stuff > * no pinned versions of any doc dependencies > * using some deprecated options > * race condition with jekyll serve -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31311) Benchmark date-time rebasing in ORC datasource
[ https://issues.apache.org/jira/browse/SPARK-31311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-31311: --- Description: * Benchmark saving dates/timestamps before and after 1582-10-15 * Benchmark loading dates/timestamps was: * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on > Benchmark date-time rebasing in ORC datasource > -- > > Key: SPARK-31311 > URL: https://issues.apache.org/jira/browse/SPARK-31311 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > * Benchmark saving dates/timestamps before and after 1582-10-15 > * Benchmark loading dates/timestamps -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31311) Benchmark date-time rebasing in ORC datasource
Maxim Gekk created SPARK-31311: -- Summary: Benchmark date-time rebasing in ORC datasource Key: SPARK-31311 URL: https://issues.apache.org/jira/browse/SPARK-31311 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk Assignee: Maxim Gekk Fix For: 3.0.0 * Add benchmarks for saving dates/timestamps to parquet when spark.sql.legacy.parquet.rebaseDateTime.enabled is set to true * Add bechmark for loading dates/timestamps from parquet when rebasing is on -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6761) Approximate quantile
[ https://issues.apache.org/jira/browse/SPARK-6761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17071490#comment-17071490 ] Shyam commented on SPARK-6761: -- [~mengxr] can you please advice on this bug... https://issues.apache.org/jira/browse/SPARK-31310 > Approximate quantile > > > Key: SPARK-6761 > URL: https://issues.apache.org/jira/browse/SPARK-6761 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: L. C. Hsieh >Priority: Major > Fix For: 2.0.0 > > > See mailing list discussion: > http://apache-spark-developers-list.1001551.n3.nabble.com/Approximate-rank-based-statistics-median-95-th-percentile-etc-for-Spark-td11414.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31310) percentile_approx function not working as expected
Shyam created SPARK-31310: - Summary: percentile_approx function not working as expected Key: SPARK-31310 URL: https://issues.apache.org/jira/browse/SPARK-31310 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 2.4.3, 2.4.1, 2.4.0 Environment: park-sql-2.4.1v with Java 8 Reporter: Shyam Fix For: 2.4.3 I'm using spark-sql-2.4.1v with Java 8 and I'm trying to do find quantiles, i.e. percentile 0, percentile 25, etc, on the given column data of dataframe. Column values data set is as below 23456.55,34532.55,23456.55 When I use percentile_approx() function the results are not matching to that of Excel percentile_inc() function. Ex : for above data set i.e. 23456.55,34532.55,23456.55 percentile_0,percentile_10,percentile_25,percentile_50,percentile_75,percentile_90,percentile_100 respectively using percentile_approx() function 23456.55,23456.55,23456.55,23456.55,23456.55,23456.55,23456.55 Using excel i.e. percentile_inc() 23456.55,23456.55,23456.55,23456.55,28994.5503,32317.3502,34532.55 How to get correct percentiles as excel using percentile_approx() function? For the details please check it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org