[jira] [Updated] (SPARK-29361) Enable streaming source support on DSv1
[ https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29361: - Component/s: (was: Spark Core) SQL > Enable streaming source support on DSv1 > > > Key: SPARK-29361 > URL: https://issues.apache.org/jira/browse/SPARK-29361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some > times before, Spark community suggested to not deal with old DSv2 and wait > for new DSv2. > The only consistent option between Spark 2.x and 3.x is DSv1, but supporting > streaming source is missed in DSv1. This issue tracks the effort to add > support for streaming source on DSv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29361) Enable DataFrame with streaming source support on DSv1
[ https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim updated SPARK-29361: - Summary: Enable DataFrame with streaming source support on DSv1(was: Enable streaming source support on DSv1 ) > Enable DataFrame with streaming source support on DSv1 > > > Key: SPARK-29361 > URL: https://issues.apache.org/jira/browse/SPARK-29361 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some > times before, Spark community suggested to not deal with old DSv2 and wait > for new DSv2. > The only consistent option between Spark 2.x and 3.x is DSv1, but supporting > streaming source is missed in DSv1. This issue tracks the effort to add > support for streaming source on DSv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29361) Enable streaming source support on DSv1
[ https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944987#comment-16944987 ] Jungtaek Lim commented on SPARK-29361: -- The plan for now is overloading below methods marked as "DeveloperApi" to have boolean field "isStreaming", like SQLContext.internalCreateDataFrame which is not a public API. > SQLContext {code} def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean isStreaming): DataFrame def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean isStreaming): DataFrame {code} > SparkSession {code} def createDataFrame(rowRDD: RDD[Row], schema: StructType, boolean isStreaming): DataFrame def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean isStreaming): DataFrame {code} since they finally calls SparkSession.internalCreateDataFrame which has isStreaming field, it is just passing additional parameter. Given we don't allow default parameter for developer api to support interop with Java, 4 new methods should be introduced instead of fixing existing 4 methods. > Enable streaming source support on DSv1 > > > Key: SPARK-29361 > URL: https://issues.apache.org/jira/browse/SPARK-29361 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some > times before, Spark community suggested to not deal with old DSv2 and wait > for new DSv2. > The only consistent option between Spark 2.x and 3.x is DSv1, but supporting > streaming source is missed in DSv1. This issue tracks the effort to add > support for streaming source on DSv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29361) Enable streaming source support on DSv1
Jungtaek Lim created SPARK-29361: Summary: Enable streaming source support on DSv1 Key: SPARK-29361 URL: https://issues.apache.org/jira/browse/SPARK-29361 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: Jungtaek Lim DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some times before, Spark community suggested to not deal with old DSv2 and wait for new DSv2. The only consistent option between Spark 2.x and 3.x is DSv1, but supporting streaming source is missed in DSv1. This issue tracks the effort to add support for streaming source on DSv1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29267) rdd.countApprox should stop when 'timeout'
[ https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944984#comment-16944984 ] Kangtian commented on SPARK-29267: -- [~hyukjin.kwon] Just finish when timeout, we don't need final value ~ In my case, i used *partly partitions* to get approximate counting, without timeout param. (in my case, more then 100 partitions) !image-2019-10-05-12-38-26-867.png! !image-2019-10-05-12-38-52-039.png! > rdd.countApprox should stop when 'timeout' > -- > > Key: SPARK-29267 > URL: https://issues.apache.org/jira/browse/SPARK-29267 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kangtian >Priority: Minor > Attachments: image-2019-10-05-12-37-22-927.png, > image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png > > > {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} > +countApprox(timeout: Long, confidence: Double = 0.95)+ > > But: > when timeout comes, the job will continue run until really finish. > > We Want: > *When timeout comes, the job will finish{color:#FF} immediately{color}*, > without FinalValue > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'
[ https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kangtian updated SPARK-29267: - Attachment: image-2019-10-05-12-38-52-039.png > rdd.countApprox should stop when 'timeout' > -- > > Key: SPARK-29267 > URL: https://issues.apache.org/jira/browse/SPARK-29267 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kangtian >Priority: Minor > Attachments: image-2019-10-05-12-37-22-927.png, > image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png > > > {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} > +countApprox(timeout: Long, confidence: Double = 0.95)+ > > But: > when timeout comes, the job will continue run until really finish. > > We Want: > *When timeout comes, the job will finish{color:#FF} immediately{color}*, > without FinalValue > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'
[ https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kangtian updated SPARK-29267: - Attachment: image-2019-10-05-12-37-22-927.png > rdd.countApprox should stop when 'timeout' > -- > > Key: SPARK-29267 > URL: https://issues.apache.org/jira/browse/SPARK-29267 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kangtian >Priority: Minor > Attachments: image-2019-10-05-12-37-22-927.png > > > {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} > +countApprox(timeout: Long, confidence: Double = 0.95)+ > > But: > when timeout comes, the job will continue run until really finish. > > We Want: > *When timeout comes, the job will finish{color:#FF} immediately{color}*, > without FinalValue > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'
[ https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kangtian updated SPARK-29267: - Attachment: image-2019-10-05-12-38-26-867.png > rdd.countApprox should stop when 'timeout' > -- > > Key: SPARK-29267 > URL: https://issues.apache.org/jira/browse/SPARK-29267 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Kangtian >Priority: Minor > Attachments: image-2019-10-05-12-37-22-927.png, > image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png > > > {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}} > +countApprox(timeout: Long, confidence: Double = 0.95)+ > > But: > when timeout comes, the job will continue run until really finish. > > We Want: > *When timeout comes, the job will finish{color:#FF} immediately{color}*, > without FinalValue > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944927#comment-16944927 ] L. C. Hsieh commented on SPARK-29358: - And I also concern that it is more far away from SQL union. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
[ https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944899#comment-16944899 ] L. C. Hsieh commented on SPARK-29358: - My concern is it breaks current behavior of unionByName. > Make unionByName optionally fill missing columns with nulls > --- > > Key: SPARK-29358 > URL: https://issues.apache.org/jira/browse/SPARK-29358 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Mukul Murthy >Priority: Major > > Currently, unionByName requires two DataFrames to have the same set of > columns (even though the order can be different). It would be good to add > either an option to unionByName or a new type of union which fills in missing > columns with nulls. > {code:java} > val df1 = Seq(1, 2, 3).toDF("x") > val df2 = Seq("a", "b", "c").toDF("y") > df1.unionByName(df2){code} > This currently throws > {code:java} > org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among > (y); > {code} > Ideally, there would be a way to make this return a DataFrame containing: > {code:java} > +++ > | x| y| > +++ > | 1|null| > | 2|null| > | 3|null| > |null| a| > |null| b| > |null| c| > +++ > {code} > Currently the workaround to make this possible is by using unionByName, but > this is clunky: > {code:java} > df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias
[ https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944879#comment-16944879 ] Sean R. Owen edited comment on SPARK-27681 at 10/4/19 10:33 PM: I looked at this one more time today after clearing out some earlier 2.13 related issues. I'm pretty sure this is what we should do, all in all, which is much more in line with [~vanzin]'s take: - Generally, don't change all the Spark {{Seq}} usages in methods and return values. Just too much change. - Definitely fix all the compile errors within Spark that result in 2.13, by adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from mutable Seqs, and Maps from MapViews (similar but different change in 2.13). This is what SPARK-29292 covers. - ... and that may be it. Maybe we find a few corner cases where a public API method really does need to fix its Seq type to make sense, but I hadn't found it yet after fixing most of core Yes, this means that user apps will experience many of the same compile errors when moving to 2.13. But that's true of any app at all moving from 2.12 to 2.13, and they're relatively easy to fix into a form that works on 2.12 and 2.13 by explicitly calling {{.toSeq}} etc. I don't think we need to fix that for users. was (Author: srowen): I looked at this one more time today after clearing out some earlier 2.13 related issues. I'm pretty sure this is what we should do, all in all, which is much more in line with [~vanzin]'s take: - Generally, don't change all the Spark {{Seq}} usages in methods and return values. Just too much change. - Definitely fix all the compile errors within Spark that result in 2.13, by adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from mutable Seqs, and Maps from MapViews (similar but different change in 2.13). This is what SPARK-29292 covers. - ... and that may be it. Maybe we find a few corner cases where a public API method really does need to fix its Seq type to make sense, but I hadn't found it yet after fixing most of core > Use scala.collection.Seq explicitly instead of scala.Seq alias > -- > > Key: SPARK-27681 > URL: https://issues.apache.org/jira/browse/SPARK-27681 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > > {{scala.Seq}} is widely used in the code, and is an alias for > {{scala.collection.Seq}} in Scala 2.12. It will become an alias for > {{scala.collection.immutable.Seq}} in Scala 2.13. In many cases, this will be > fine, as Spark users using Scala 2.13 will also have this changed alias. In > some cases it may be undesirable, as it will cause some code to compile in > 2.12 but not in 2.13. In some cases, making the type {{scala.collection.Seq}} > explicit so that it doesn't vary can help avoid this, so that Spark apps > might cross-compile for 2.12 and 2.13 with the same source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias
[ https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944879#comment-16944879 ] Sean R. Owen commented on SPARK-27681: -- I looked at this one more time today after clearing out some earlier 2.13 related issues. I'm pretty sure this is what we should do, all in all, which is much more in line with [~vanzin]'s take: - Generally, don't change all the Spark {{Seq}} usages in methods and return values. Just too much change. - Definitely fix all the compile errors within Spark that result in 2.13, by adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from mutable Seqs, and Maps from MapViews (similar but different change in 2.13). This is what SPARK-29292 covers. - ... and that may be it. Maybe we find a few corner cases where a public API method really does need to fix its Seq type to make sense, but I hadn't found it yet after fixing most of core > Use scala.collection.Seq explicitly instead of scala.Seq alias > -- > > Key: SPARK-27681 > URL: https://issues.apache.org/jira/browse/SPARK-27681 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib, Spark Core, SQL, Structured Streaming >Affects Versions: 3.0.0 >Reporter: Sean R. Owen >Assignee: Sean R. Owen >Priority: Major > > {{scala.Seq}} is widely used in the code, and is an alias for > {{scala.collection.Seq}} in Scala 2.12. It will become an alias for > {{scala.collection.immutable.Seq}} in Scala 2.13. In many cases, this will be > fine, as Spark users using Scala 2.13 will also have this changed alias. In > some cases it may be undesirable, as it will cause some code to compile in > 2.12 but not in 2.13. In some cases, making the type {{scala.collection.Seq}} > explicit so that it doesn't vary can help avoid this, so that Spark apps > might cross-compile for 2.12 and 2.13 with the same source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-28813: Assignee: Huaxin Gao > Document SHOW CREATE TABLE in SQL Reference. > > > Key: SPARK-28813 > URL: https://issues.apache.org/jira/browse/SPARK-28813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Huaxin Gao >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-28813. -- Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25885 [https://github.com/apache/spark/pull/25885] > Document SHOW CREATE TABLE in SQL Reference. > > > Key: SPARK-28813 > URL: https://issues.apache.org/jira/browse/SPARK-28813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.
[ https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-28813: - Priority: Minor (was: Major) > Document SHOW CREATE TABLE in SQL Reference. > > > Key: SPARK-28813 > URL: https://issues.apache.org/jira/browse/SPARK-28813 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SQL >Affects Versions: 2.4.3 >Reporter: Dilip Biswal >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29108) Add window.sql - Part 2
[ https://issues.apache.org/jira/browse/SPARK-29108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29108: -- Fix Version/s: (was: 3.0.0) > Add window.sql - Part 2 > --- > > Key: SPARK-29108 > URL: https://issues.apache.org/jira/browse/SPARK-29108 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29109) Add window.sql - Part 3
[ https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29109: -- Fix Version/s: (was: 3.0.0) > Add window.sql - Part 3 > --- > > Key: SPARK-29109 > URL: https://issues.apache.org/jira/browse/SPARK-29109 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29110) Add window.sql - Part 4
[ https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29110: -- Fix Version/s: (was: 3.0.0) > Add window.sql - Part 4 > --- > > Key: SPARK-29110 > URL: https://issues.apache.org/jira/browse/SPARK-29110 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In this ticket, we plan to add the regression test cases of > [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29107) Add window.sql - Part 1
[ https://issues.apache.org/jira/browse/SPARK-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29107: -- Fix Version/s: (was: 3.0.0) > Add window.sql - Part 1 > --- > > Key: SPARK-29107 > URL: https://issues.apache.org/jira/browse/SPARK-29107 > Project: Spark > Issue Type: Sub-task > Components: SQL, Tests >Affects Versions: 3.0.0 >Reporter: Dylan Guedes >Priority: Major > > In this ticket, we plan to add the regression test cases of > https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page
[ https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29323: -- Affects Version/s: (was: 2.4.4) 3.0.0 > Add tooltip for The Executors Tab's column names in the Spark history server > Page > - > > Key: SPARK-29323 > URL: https://issues.apache.org/jira/browse/SPARK-29323 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 3.0.0 >Reporter: liucht-inspur >Priority: Major > Attachments: image-2019-10-04-09-42-14-174.png > > > the spark Executors of history Tab page, the Summary part shows the line in > the list of title, but format is irregular. > Some column names have tooltip, such as Storage Memory, Task Time(GC Time), > Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some > list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, > Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, > Executors section below,All the column names Contains the column names above > have tooltip . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page
[ https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944827#comment-16944827 ] Dongjoon Hyun commented on SPARK-29323: --- Hi, [~liucht-inspur]. Thank you for creating a JIRA, but please don't use `Fixed Version` because this is not fixed yet. For the detail, please see http://spark.apache.org/contributing.html . > Add tooltip for The Executors Tab's column names in the Spark history server > Page > - > > Key: SPARK-29323 > URL: https://issues.apache.org/jira/browse/SPARK-29323 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.4.4 >Reporter: liucht-inspur >Priority: Major > Fix For: 2.4.4 > > Attachments: image-2019-10-04-09-42-14-174.png > > > the spark Executors of history Tab page, the Summary part shows the line in > the list of title, but format is irregular. > Some column names have tooltip, such as Storage Memory, Task Time(GC Time), > Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some > list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, > Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, > Executors section below,All the column names Contains the column names above > have tooltip . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page
[ https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29323: -- Fix Version/s: (was: 2.4.4) > Add tooltip for The Executors Tab's column names in the Spark history server > Page > - > > Key: SPARK-29323 > URL: https://issues.apache.org/jira/browse/SPARK-29323 > Project: Spark > Issue Type: New Feature > Components: Web UI >Affects Versions: 2.4.4 >Reporter: liucht-inspur >Priority: Major > Attachments: image-2019-10-04-09-42-14-174.png > > > the spark Executors of history Tab page, the Summary part shows the line in > the list of title, but format is irregular. > Some column names have tooltip, such as Storage Memory, Task Time(GC Time), > Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some > list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, > Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, > Executors section below,All the column names Contains the column names above > have tooltip . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive
[ https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944826#comment-16944826 ] Dongjoon Hyun commented on SPARK-29225: --- Hi, [~angerszhuuu]. Please use `3.0.0` for the improvement issue. > Spark SQL 'DESC FORMATTED TABLE' show different format with hive > > > Key: SPARK-29225 > URL: https://issues.apache.org/jira/browse/SPARK-29225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot > 2019-09-24 at 9.26.14 PM.png, current saprk.jpg > > > Current `DESC FORMATTED TABLE` show different table desc format, this problem > cause HUE can't parser column information corretltly : > *spark* > !current saprk.jpg! > *HIVE* > !Screen Shot 2019-09-24 at 9.26.14 PM.png! > *Spark SQL* *expected* > !Screen Shot 2019-09-24 at 9.14.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive
[ https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29225: -- Priority: Minor (was: Major) > Spark SQL 'DESC FORMATTED TABLE' show different format with hive > > > Key: SPARK-29225 > URL: https://issues.apache.org/jira/browse/SPARK-29225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0, 3.0.0 >Reporter: angerszhu >Priority: Minor > Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot > 2019-09-24 at 9.26.14 PM.png, current saprk.jpg > > > Current `DESC FORMATTED TABLE` show different table desc format, this problem > cause HUE can't parser column information corretltly : > *spark* > !current saprk.jpg! > *HIVE* > !Screen Shot 2019-09-24 at 9.26.14 PM.png! > *Spark SQL* *expected* > !Screen Shot 2019-09-24 at 9.14.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive
[ https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29225: -- Affects Version/s: (was: 2.4.0) > Spark SQL 'DESC FORMATTED TABLE' show different format with hive > > > Key: SPARK-29225 > URL: https://issues.apache.org/jira/browse/SPARK-29225 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot > 2019-09-24 at 9.26.14 PM.png, current saprk.jpg > > > Current `DESC FORMATTED TABLE` show different table desc format, this problem > cause HUE can't parser column information corretltly : > *spark* > !current saprk.jpg! > *HIVE* > !Screen Shot 2019-09-24 at 9.26.14 PM.png! > *Spark SQL* *expected* > !Screen Shot 2019-09-24 at 9.14.39 PM.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25440) Dump query execution info to a file
[ https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944822#comment-16944822 ] Maxim Gekk commented on SPARK-25440: [~shashidha...@gmail.com] Call *df.queryExecution.debug.toFile()*. You can see examples in this test suite: https://github.com/apache/spark/blob/97dc4c0bfc3a15d364a376c6f87cb921d8d6980d/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala > Dump query execution info to a file > --- > > Key: SPARK-25440 > URL: https://issues.apache.org/jira/browse/SPARK-25440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Output of the explain() doesn't contain full information and in some cases > can be truncated. Besides of that it saves info to a string in memory which > can cause OOM. The ticket aims to solve the problem and dump info about query > execution to a file. Need to add new method to queryExecution.debug which > accepts a path to a file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29360) PySpark FPGrowthModel supports getter/setter
[ https://issues.apache.org/jira/browse/SPARK-29360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944821#comment-16944821 ] Huaxin Gao commented on SPARK-29360: I will submit a PR soon. > PySpark FPGrowthModel supports getter/setter > > > Key: SPARK-29360 > URL: https://issues.apache.org/jira/browse/SPARK-29360 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29360) PySpark FPGrowthModel supports getter/setter
Huaxin Gao created SPARK-29360: -- Summary: PySpark FPGrowthModel supports getter/setter Key: SPARK-29360 URL: https://issues.apache.org/jira/browse/SPARK-29360 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.0.0 Reporter: Huaxin Gao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944817#comment-16944817 ] Dongjoon Hyun edited comment on SPARK-25753 at 10/4/19 8:51 PM: This is merged to master via https://github.com/apache/spark/pull/22725 , and This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 . was (Author: dongjoon): This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 . > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4, 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25753: -- Fix Version/s: 2.4.5 > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4, 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944817#comment-16944817 ] Dongjoon Hyun commented on SPARK-25753: --- This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 . > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4, 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 2.4.5, 3.0.0 > > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25753) binaryFiles broken for small files
[ https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-25753: -- Affects Version/s: 2.4.4 > binaryFiles broken for small files > -- > > Key: SPARK-25753 > URL: https://issues.apache.org/jira/browse/SPARK-25753 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.4.4, 3.0.0 >Reporter: liuxian >Assignee: liuxian >Priority: Minor > Fix For: 3.0.0 > > > _{{StreamFileInputFormat}}_ and > {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}} > have the same problem: for small sized files, the computed maxSplitSize by > `_{{StreamFileInputFormat}}_ ` is way smaller than the default or commonly > used split size of 64/128M and spark throws an exception while trying to read > them. > {{Exception info:}} > _{{Minimum split size pernode 5123456 cannot be larger than maximum split > size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot > be larger than maximum split size 4194304 at > org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: > 201) at > org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at > org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at > scala.Option.getOrElse(Option.scala:121) at > org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at > org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_ -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24540) Support for multiple character delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jeff Evans updated SPARK-24540: --- Summary: Support for multiple character delimiter in Spark CSV read (was: Support for multiple delimiter in Spark CSV read) > Support for multiple character delimiter in Spark CSV read > -- > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read
[ https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944783#comment-16944783 ] Jeff Evans commented on SPARK-24540: I created a pull request to support this (which was linked above). I'm not entirely clear on why SPARK-17967 would be a blocker, though. > Support for multiple delimiter in Spark CSV read > > > Key: SPARK-24540 > URL: https://issues.apache.org/jira/browse/SPARK-24540 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Ashwin K >Priority: Major > > Currently, the delimiter option Spark 2.0 to read and split CSV files/data > only support a single character delimiter. If we try to provide multiple > delimiters, we observer the following error message. > eg: Dataset df = spark.read().option("inferSchema", "true") > .option("header", > "false") > .option("delimiter", > ", ") > .csv("C:\test.txt"); > Exception in thread "main" java.lang.IllegalArgumentException: Delimiter > cannot be more than one character: , > at > org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83) > at > org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39) > at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at > org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202) > at scala.Option.orElse(Option.scala:289) > at > org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201) > at > org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392) > at > org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) > at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) > at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) > > Generally, the data to be processed contains multiple delimiters and > presently we need to do a manual data clean up on the source/input file, > which doesn't work well in large applications which consumes numerous files. > There seems to be work-around like reading data as text and using the split > option, but this in my opinion defeats the purpose, advantage and efficiency > of a direct read from CSV file. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29359) Better exception handling in SQLQueryTestSuite and ThriftServerQueryTestSuite
Peter Toth created SPARK-29359: -- Summary: Better exception handling in SQLQueryTestSuite and ThriftServerQueryTestSuite Key: SPARK-29359 URL: https://issues.apache.org/jira/browse/SPARK-29359 Project: Spark Issue Type: Bug Components: Tests Affects Versions: 3.0.0 Reporter: Peter Toth SQLQueryTestSuite and ThriftServerQueryTestSuite should have the same exception handling to avoid issues like this: {noformat} Expected "[Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit org.apache.spark.SparkException]", but got "[org.apache.spark.SparkException Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit]" {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29340) Spark Sql executions do not use thread local jobgroup
[ https://issues.apache.org/jira/browse/SPARK-29340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Navdeep Poonia resolved SPARK-29340. Resolution: Not A Problem Hi [~hyukjin.kwon] thanks for your response, you made me question our internal codebase and after few hours of debugging it turns out that one of scala collection had .par iteration on spark actions which will create new threads and hence thread local spark configs were not propagated to task scheduler. Job groups with spark sql are working perfectly as expected. > Spark Sql executions do not use thread local jobgroup > - > > Key: SPARK-29340 > URL: https://issues.apache.org/jira/browse/SPARK-29340 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Navdeep Poonia >Priority: Major > > val sparkThreadLocal: SparkSession = DataCurator.spark.newSession() > sparkThreadLocal.sparkContext.setJobGroup("", "") > OR > sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", > "") > sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", > "") > > The jobgroup property works fine for spark jobs/stages created by spark > dataframe operations but in case of sparksql the jobgroup is randomly > assigned to stages or is null sometimes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Labels: release-notes (was: ) > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Labels: release-notes > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > Below shows a basic example of DPP. > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Description: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). Below shows a basic example of DPP. !image-2019-10-04-11-20-02-616.png|width=521,height=225! was: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). Below is an example to show how it takes an effect !image-2019-10-04-11-20-02-616.png|width=521,height=225! > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > Below shows a basic example of DPP. > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Description: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). Below is an example to show how it takes an effect !image-2019-10-04-11-20-02-616.png|width=521,height=225! was: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). !image-2019-10-04-11-20-02-616.png|width=521,height=225! > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > Below is an example to show how it takes an effect > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Attachment: image-2019-10-04-11-20-02-616.png > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Partitions are not pruned when joined on the partition columns. > This is the same issue as HIVE-9152. > Ex: > Select from tab where partcol=1 will prune on value 1 > Select from tab join dim on (dim.partcol=tab.partcol) where > dim.partcol=1 will scan all partitions. > Tables are based on parquets. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Description: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). !image-2019-10-04-11-20-02-616.png! was: Partitions are not pruned when joined on the partition columns. This is the same issue as HIVE-9152. Ex: Select from tab where partcol=1 will prune on value 1 Select from tab join dim on (dim.partcol=tab.partcol) where dim.partcol=1 will scan all partitions. Tables are based on parquets. > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > > !image-2019-10-04-11-20-02-616.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11150) Dynamic partition pruning
[ https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-11150: Description: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). !image-2019-10-04-11-20-02-616.png|width=521,height=225! was: Implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach: # As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the {{ReuseExchange}} rule; or # As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise # As a bypassed condition ({{true}}). !image-2019-10-04-11-20-02-616.png! > Dynamic partition pruning > - > > Key: SPARK-11150 > URL: https://issues.apache.org/jira/browse/SPARK-11150 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0 >Reporter: Younes >Assignee: Wei Xue >Priority: Major > Fix For: 3.0.0 > > Attachments: image-2019-10-04-11-20-02-616.png > > > Implements dynamic partition pruning by adding a dynamic-partition-pruning > filter if there is a partitioned table and a filter on the dimension table. > The filter is then planned using a heuristic approach: > # As a broadcast relation if it is a broadcast hash join. The broadcast > relation will then be transformed into a reused broadcast exchange by the > {{ReuseExchange}} rule; or > # As a subquery duplicate if the estimated benefit of partition table scan > being saved is greater than the estimated cost of the extra scan of the > duplicated subquery; otherwise > # As a bypassed condition ({{true}}). > > !image-2019-10-04-11-20-02-616.png|width=521,height=225! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944736#comment-16944736 ] Srini E commented on SPARK-29337: - Hi Wang, We don't have cache options while we are trying to cache a table through spark-sql.We can just use cache table table name and there are no options like storage level. > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29358) Make unionByName optionally fill missing columns with nulls
Mukul Murthy created SPARK-29358: Summary: Make unionByName optionally fill missing columns with nulls Key: SPARK-29358 URL: https://issues.apache.org/jira/browse/SPARK-29358 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.4 Reporter: Mukul Murthy Currently, unionByName requires two DataFrames to have the same set of columns (even though the order can be different). It would be good to add either an option to unionByName or a new type of union which fills in missing columns with nulls. {code:java} val df1 = Seq(1, 2, 3).toDF("x") val df2 = Seq("a", "b", "c").toDF("y") df1.unionByName(df2){code} This currently throws {code:java} org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among (y); {code} Ideally, there would be a way to make this return a DataFrame containing: {code:java} +++ | x| y| +++ | 1|null| | 2|null| | 3|null| |null| a| |null| b| |null| c| +++ {code} Currently the workaround to make this possible is by using unionByName, but this is clunky: {code:java} df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null))) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29357) Fix the flaky test in DataFrameSuite
[ https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29357: -- Priority: Minor (was: Major) > Fix the flaky test in DataFrameSuite > > > Key: SPARK-29357 > URL: https://issues.apache.org/jira/browse/SPARK-29357 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Minor > > Fix the test `SPARK-25159: json schema inference should only trigger one job` > by changing to use AtomicLong instead of a var that will not always be > updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29357) Fix the flaky test in DataFrameSuite
[ https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29357: -- Issue Type: Bug (was: Improvement) > Fix the flaky test in DataFrameSuite > > > Key: SPARK-29357 > URL: https://issues.apache.org/jira/browse/SPARK-29357 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Priority: Major > > Fix the test `SPARK-25159: json schema inference should only trigger one job` > by changing to use AtomicLong instead of a var that will not always be > updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29357) Fix the flaky test in DataFrameSuite
[ https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29357. --- Fix Version/s: 3.0.0 Assignee: Yuanjian Li Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26020 > Fix the flaky test in DataFrameSuite > > > Key: SPARK-29357 > URL: https://issues.apache.org/jira/browse/SPARK-29357 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0 >Reporter: Yuanjian Li >Assignee: Yuanjian Li >Priority: Minor > Fix For: 3.0.0 > > > Fix the test `SPARK-25159: json schema inference should only trigger one job` > by changing to use AtomicLong instead of a var that will not always be > updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance
[ https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29286: -- Affects Version/s: 2.4.4 > UnicodeDecodeError raised when running python tests on arm instance > --- > > Key: SPARK-29286 > URL: https://issues.apache.org/jira/browse/SPARK-29286 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.4.4, 3.0.0 >Reporter: huangtianhua >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Run command 'python/run-tests --python-executables=python2.7,python3.6' on > arm instance, then UnicodeDecodeError raised: > > Starting test(python2.7): pyspark.broadcast > Got an exception while trying to store skipped test output: > Traceback (most recent call last): > File "./python/run-tests.py", line 137, in run_individual_python_test > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > File "./python/run-tests.py", line 137, in > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: > ordinal not in range(128) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance
[ https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-29286: -- Fix Version/s: 2.4.5 > UnicodeDecodeError raised when running python tests on arm instance > --- > > Key: SPARK-29286 > URL: https://issues.apache.org/jira/browse/SPARK-29286 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: huangtianhua >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.5, 3.0.0 > > > Run command 'python/run-tests --python-executables=python2.7,python3.6' on > arm instance, then UnicodeDecodeError raised: > > Starting test(python2.7): pyspark.broadcast > Got an exception while trying to store skipped test output: > Traceback (most recent call last): > File "./python/run-tests.py", line 137, in run_individual_python_test > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > File "./python/run-tests.py", line 137, in > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: > ordinal not in range(128) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance
[ https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29286. --- Fix Version/s: 3.0.0 Assignee: Hyukjin Kwon Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26021 > UnicodeDecodeError raised when running python tests on arm instance > --- > > Key: SPARK-29286 > URL: https://issues.apache.org/jira/browse/SPARK-29286 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 3.0.0 >Reporter: huangtianhua >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > Run command 'python/run-tests --python-executables=python2.7,python3.6' on > arm instance, then UnicodeDecodeError raised: > > Starting test(python2.7): pyspark.broadcast > Got an exception while trying to store skipped test output: > Traceback (most recent call last): > File "./python/run-tests.py", line 137, in run_individual_python_test > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > File "./python/run-tests.py", line 137, in > decoded_lines = map(lambda line: line.decode(), iter(per_test_output)) > UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: > ordinal not in range(128) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29355) Support timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-29355. --- Fix Version/s: 3.0.0 Assignee: Maxim Gekk Resolution: Fixed This is resolved via https://github.com/apache/spark/pull/26022 > Support timestamps subtraction > -- > > Key: SPARK-29355 > URL: https://issues.apache.org/jira/browse/SPARK-29355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Assignee: Maxim Gekk >Priority: Major > Fix For: 3.0.0 > > > ||Operator||Example||Result|| > |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 > 12:00'}}|{{interval '1 day 15:00:00'}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark
[ https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944564#comment-16944564 ] Furcy Pin commented on SPARK-13587: --- Hello, I don't know where to ask this, but we have been using this feature on HDInsight 2.6.5 and we sometimes have a concurrency issue with pip. Basically it looks like in rare occasions, several executors set up the virtualenv simultaneously, which ends up in a kind of deadlock. When running the pip install command used by the executor manually, it suddenly hangs and when cancel throws this error : {code:java} File "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py", line 31, in acquire os.link(self.unique_name, self.lock_file) FileExistsError: [Errno 17] File exists: '/home/yarn/-' -> '/home/yarn/selfcheck.json.lock'{code} This happens with "spark.pyspark.virtualenv.type=native". We haven't tried with conda yet. It is pretty bad because when it happens: - some executors of the spark job just get stuck, and the spark job gets stuck - even if the job is restarted, the lock files stays there and causes the whole YARN host to be useless. Any suggestion or workaround would be appreciated. One idea would be to remove the "--cache-dir /home/yarn" option which is currently used in the pip install command, but it doesn't seem to be configurable right now. > Support virtualenv in PySpark > - > > Key: SPARK-13587 > URL: https://issues.apache.org/jira/browse/SPARK-13587 > Project: Spark > Issue Type: New Feature > Components: PySpark >Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Jeff Zhang >Priority: Major > > Currently, it's not easy for user to add third party python packages in > pyspark. > * One way is to using --py-files (suitable for simple dependency, but not > suitable for complicated dependency, especially with transitive dependency) > * Another way is install packages manually on each node (time wasting, and > not easy to switch to different environment) > Python has now 2 different virtualenv implementation. One is native > virtualenv another is through conda. This jira is trying to migrate these 2 > tools to distributed environment -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29276) Spark job fails because of timeout to Driver
[ https://issues.apache.org/jira/browse/SPARK-29276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944526#comment-16944526 ] Jochen Hebbrecht commented on SPARK-29276: -- Thanks, I've just send out a mail on the mailing list :-) > Spark job fails because of timeout to Driver > > > Key: SPARK-29276 > URL: https://issues.apache.org/jira/browse/SPARK-29276 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Jochen Hebbrecht >Priority: Major > > Hi, > I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job > towards the cluster. Thhe job gets accepted, but the YARN application fails > with: > {code} > 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: > 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: > Futures timed out after [10 milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > {code} > It actually goes wrong at this line: > https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 > Now, I'm 100% sure Spark is OK and there's no bug, but there must be > something wrong with my setup. I don't understand the code of the > ApplicationMaster, so could somebody explain me what it is trying to reach? > Where exactly does the connection timeout? So at least I can debug it further > because I don't have a clue what it is doing :-) > Thanks for any help! > Jochen -- This message was sent by Atlassian Jira (v8.3.4#803005) --
[jira] [Commented] (SPARK-25440) Dump query execution info to a file
[ https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944434#comment-16944434 ] Shasidhar E S commented on SPARK-25440: --- [~maxgekk] How do we use this feature? Is there an example on how to configure file path to dump the query plan into a file? > Dump query execution info to a file > --- > > Key: SPARK-25440 > URL: https://issues.apache.org/jira/browse/SPARK-25440 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Maxim Gekk >Priority: Minor > Fix For: 3.0.0 > > > Output of the explain() doesn't contain full information and in some cases > can be truncated. Besides of that it saves info to a string in memory which > can cause OOM. The ticket aims to solve the problem and dump info about query > execution to a file. Need to add new method to queryExecution.debug which > accepts a path to a file. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29357) Fix the flaky test in DataFrameSuite
Yuanjian Li created SPARK-29357: --- Summary: Fix the flaky test in DataFrameSuite Key: SPARK-29357 URL: https://issues.apache.org/jira/browse/SPARK-29357 Project: Spark Issue Type: Improvement Components: Tests Affects Versions: 3.0.0 Reporter: Yuanjian Li Fix the test `SPARK-25159: json schema inference should only trigger one job` by changing to use AtomicLong instead of a var that will not always be updated. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29356) Stopping Spark doesn't shut down all network connections
[ https://issues.apache.org/jira/browse/SPARK-29356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Malthe Borch updated SPARK-29356: - Description: The Spark session's gateway client still has an open network connection after a call to `spark.stop()`. This is unexpected and for example in a test suite, this triggers a resource warning when tearing down the test case. (was: The Spark session's gateway client still has an open network connection after a call to `spark.stop()`. This is unexpected and in for example a test suite, this triggers a resource warning when tearing down the test case.) > Stopping Spark doesn't shut down all network connections > > > Key: SPARK-29356 > URL: https://issues.apache.org/jira/browse/SPARK-29356 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.4.4 >Reporter: Malthe Borch >Priority: Minor > > The Spark session's gateway client still has an open network connection after > a call to `spark.stop()`. This is unexpected and for example in a test suite, > this triggers a resource warning when tearing down the test case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29356) Stopping Spark doesn't shut down all network connections
Malthe Borch created SPARK-29356: Summary: Stopping Spark doesn't shut down all network connections Key: SPARK-29356 URL: https://issues.apache.org/jira/browse/SPARK-29356 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.4.4 Reporter: Malthe Borch The Spark session's gateway client still has an open network connection after a call to `spark.stop()`. This is unexpected and in for example a test suite, this triggers a resource warning when tearing down the test case. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28425) Add more Date/Time Operators
[ https://issues.apache.org/jira/browse/SPARK-28425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-28425: --- Description: ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html was: ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html > Add more Date/Time Operators > > > Key: SPARK-28425 > URL: https://issues.apache.org/jira/browse/SPARK-28425 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > ||Operator||Example||Result|| > |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 > 01:00:00'}}| > |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 > 23:00:00'}}| > |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| > |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| > |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| > |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29355) Support timestamps subtraction
[ https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Maxim Gekk updated SPARK-29355: --- Description: ||Operator||Example||Result|| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html was: ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html > Support timestamps subtraction > -- > > Key: SPARK-29355 > URL: https://issues.apache.org/jira/browse/SPARK-29355 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Maxim Gekk >Priority: Major > > ||Operator||Example||Result|| > |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 > 12:00'}}|{{interval '1 day 15:00:00'}}| > https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-29355) Support timestamps subtraction
Maxim Gekk created SPARK-29355: -- Summary: Support timestamps subtraction Key: SPARK-29355 URL: https://issues.apache.org/jira/browse/SPARK-29355 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.0.0 Reporter: Maxim Gekk ||Operator||Example||Result|| |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 01:00:00'}}| |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 23:00:00'}}| |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 12:00'}}|{{interval '1 day 15:00:00'}}| |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}| |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}| |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}| |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}| https://www.postgresql.org/docs/11/functions-datetime.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29276) Spark job fails because of timeout to Driver
[ https://issues.apache.org/jira/browse/SPARK-29276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29276. -- Resolution: Invalid Let's ask questions into mailing list or stackoverflow. You would be able to get a better answer. > Spark job fails because of timeout to Driver > > > Key: SPARK-29276 > URL: https://issues.apache.org/jira/browse/SPARK-29276 > Project: Spark > Issue Type: Question > Components: Spark Core >Affects Versions: 2.4.2 >Reporter: Jochen Hebbrecht >Priority: Major > > Hi, > I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job > towards the cluster. Thhe job gets accepted, but the YARN application fails > with: > {code} > 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: > java.util.concurrent.TimeoutException: Futures timed out after [10 > milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: > 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: > Futures timed out after [10 milliseconds] > at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) > at > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220) > at > org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468) > at > org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245) > at > org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844) > at > org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778) > at > org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244) > at > org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) > at > org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) > {code} > It actually goes wrong at this line: > https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468 > Now, I'm 100% sure Spark is OK and there's no bug, but there must be > something wrong with my setup. I don't understand the code of the > ApplicationMaster, so could somebody explain me what it is trying to reach? > Where exactly does the connection timeout? So at least I can debug it further > because I don't have a clue what it is doing :-) > Thanks for any help! > Jochen -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (SPARK-29302) dynamic partition overwrite with speculation enabled
[ https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29302. -- Resolution: Invalid > dynamic partition overwrite with speculation enabled > > > Key: SPARK-29302 > URL: https://issues.apache.org/jira/browse/SPARK-29302 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: feiwang >Priority: Major > > Now, for a dynamic partition overwrite operation, the filename of a task > output is determinable. > So, if speculation is enabled, would a task conflict with its relative > speculation task? > Would the two tasks concurrent write a same file? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-29304) Input Bytes Metric for Datasource v2 is absent
[ https://issues.apache.org/jira/browse/SPARK-29304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-29304: - Target Version/s: (was: 3.0.0) > Input Bytes Metric for Datasource v2 is absent > -- > > Key: SPARK-29304 > URL: https://issues.apache.org/jira/browse/SPARK-29304 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 3.0.0 >Reporter: Kuhu Shukla >Priority: Major > Attachments: jira-spark.png > > > Input metrics while reading a simple CSV file with > {code} > spark.read.csv() > {code} > are absent. > Switching to v1 data source works. Adding the inputMetrics calculation from > FileScanRDD to DataSourceRDD helps get the values. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29309) StreamingContext.binaryRecordsStream() is useless
[ https://issues.apache.org/jira/browse/SPARK-29309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944318#comment-16944318 ] Hyukjin Kwon commented on SPARK-29309: -- So what do you propose? > StreamingContext.binaryRecordsStream() is useless > - > > Key: SPARK-29309 > URL: https://issues.apache.org/jira/browse/SPARK-29309 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: Alberto Andreotti >Priority: Major > > Supporting only fixed length binary records turn this function really > difficult to use. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29316) CLONE - schemaInference option not to convert strings with leading zeros to int/long
[ https://issues.apache.org/jira/browse/SPARK-29316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29316. -- Resolution: Won't Fix > CLONE - schemaInference option not to convert strings with leading zeros to > int/long > - > > Key: SPARK-29316 > URL: https://issues.apache.org/jira/browse/SPARK-29316 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.3.0 >Reporter: Ambar Raghuvanshi >Priority: Critical > Labels: csv, csvparser, easy-fix, inference, ramp-up, schema > > It would be great to have an option in Spark's schema inference to *not* to > convert to int/long datatype a column that has leading zeros. Think zip > codes, for example. > {code:java} > df = (sqlc.read.format('csv') > .option('inferSchema', True) > .option('header', True) > .option('delimiter', '|') > .option('leadingZeros', 'KEEP') # this is the new > proposed option > .option('mode', 'FAILFAST') > .load('csvfile_withzipcodes_to_ingest.csv') > ) > The general usage of data with trailing 0 is for Identifiers. If they are > converted to int/long defeats the purpose of inferSchema. The conversion > should be provided on the basis of a flag whether the data should be > converted to int/long or not. {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29039) centralize the catalog and table lookup logic
[ https://issues.apache.org/jira/browse/SPARK-29039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29039. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25747 [https://github.com/apache/spark/pull/25747] > centralize the catalog and table lookup logic > - > > Key: SPARK-29039 > URL: https://issues.apache.org/jira/browse/SPARK-29039 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server
[ https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29337. -- Resolution: Invalid Questions should go to stackoverflow or mailing list. > How to Cache Table and Pin it in Memory and should not Spill to Disk on > Thrift Server > -- > > Key: SPARK-29337 > URL: https://issues.apache.org/jira/browse/SPARK-29337 > Project: Spark > Issue Type: Question > Components: SQL >Affects Versions: 2.3.0 >Reporter: Srini E >Priority: Major > Attachments: Cache+Image.png > > > Hi Team, > How to pin the table in cache so it would not swap out of memory? > Situation: We are using Microstrategy BI reporting. Semantic layer is built. > We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table > ; we did cache for SPARK context( Thrift server). Please see > below snapshot of Cache table, went to disk over time. Initially it was all > in cache , now some in cache and some in disk. That disk may be local disk > relatively more expensive reading than from s3. Queries may take longer and > inconsistent times from user experience perspective. If More queries running > using Cache tables, copies of the cache table images are copied and copies > are not staying in memory causing reports to run longer. so how to pin the > table so would not swap to disk. Spark memory management is dynamic > allocation, and how to use those few tables to Pin in memory . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29340) Spark Sql executions do not use thread local jobgroup
[ https://issues.apache.org/jira/browse/SPARK-29340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944314#comment-16944314 ] Hyukjin Kwon commented on SPARK-29340: -- [~navdeepniku], is it possible to show the full reproducer? > Spark Sql executions do not use thread local jobgroup > - > > Key: SPARK-29340 > URL: https://issues.apache.org/jira/browse/SPARK-29340 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.4 >Reporter: Navdeep Poonia >Priority: Major > > val sparkThreadLocal: SparkSession = DataCurator.spark.newSession() > sparkThreadLocal.sparkContext.setJobGroup("", "") > OR > sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", > "") > sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", > "") > > The jobgroup property works fine for spark jobs/stages created by spark > dataframe operations but in case of sparksql the jobgroup is randomly > assigned to stages or is null sometimes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29344) Spark application hang
[ https://issues.apache.org/jira/browse/SPARK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-29344. -- Resolution: Cannot Reproduce I haven't seen such things. Please specify the steps to reproduce and environemnt. > Spark application hang > -- > > Key: SPARK-29344 > URL: https://issues.apache.org/jira/browse/SPARK-29344 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.1 >Reporter: Kitti >Priority: Major > Attachments: stderr > > > We found the issue that Spark application hang and stop working sometime > without any log in Spark Driver until we killed the application. > > 19/10/03 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 117 19/10/03 > 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 80 19/10/03 06:07:03 > INFO spark.ContextCleaner: Cleaned accumulator 105 19/10/03 06:07:03 INFO > spark.ContextCleaner: Cleaned accumulator 88 19/10/03 10:36:59 ERROR > yarn.ApplicationMaster: RECEIVED SIGNAL TERM -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure
[ https://issues.apache.org/jira/browse/SPARK-29326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-29326: --- Assignee: Gengliang Wang > ANSI store assignment policy: throw exception on insertion failure > -- > > Key: SPARK-29326 > URL: https://issues.apache.org/jira/browse/SPARK-29326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > As per ANSI SQL standard, ANSI store assignment policy should throw an > exception on insertion failure, such as inserting out-of-range value to a > numeric field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure
[ https://issues.apache.org/jira/browse/SPARK-29326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-29326. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 25997 [https://github.com/apache/spark/pull/25997] > ANSI store assignment policy: throw exception on insertion failure > -- > > Key: SPARK-29326 > URL: https://issues.apache.org/jira/browse/SPARK-29326 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Gengliang Wang >Priority: Major > Fix For: 3.0.0 > > > As per ANSI SQL standard, ANSI store assignment policy should throw an > exception on insertion failure, such as inserting out-of-range value to a > numeric field. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org