[jira] [Updated] (SPARK-29361) Enable streaming source support on DSv1

2019-10-04 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29361:
-
Component/s: (was: Spark Core)
 SQL

> Enable streaming source support on DSv1 
> 
>
> Key: SPARK-29361
> URL: https://issues.apache.org/jira/browse/SPARK-29361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some 
> times before, Spark community suggested to not deal with old DSv2 and wait 
> for new DSv2. 
> The only consistent option between Spark 2.x and 3.x is DSv1, but supporting 
> streaming source is missed in DSv1. This issue tracks the effort to add 
> support for streaming source on DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29361) Enable DataFrame with streaming source support on DSv1

2019-10-04 Thread Jungtaek Lim (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim updated SPARK-29361:
-
Summary: Enable DataFrame with streaming source support on DSv1(was: 
Enable streaming source support on DSv1 )

> Enable DataFrame with streaming source support on DSv1  
> 
>
> Key: SPARK-29361
> URL: https://issues.apache.org/jira/browse/SPARK-29361
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some 
> times before, Spark community suggested to not deal with old DSv2 and wait 
> for new DSv2. 
> The only consistent option between Spark 2.x and 3.x is DSv1, but supporting 
> streaming source is missed in DSv1. This issue tracks the effort to add 
> support for streaming source on DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29361) Enable streaming source support on DSv1

2019-10-04 Thread Jungtaek Lim (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944987#comment-16944987
 ] 

Jungtaek Lim commented on SPARK-29361:
--

The plan for now is overloading below methods marked as "DeveloperApi" to have 
boolean field "isStreaming", like SQLContext.internalCreateDataFrame which is 
not a public API.

> SQLContext

{code}
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean 
isStreaming): DataFrame
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean 
isStreaming): DataFrame
{code}

> SparkSession

{code}
def createDataFrame(rowRDD: RDD[Row], schema: StructType, boolean isStreaming): 
DataFrame
def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType, boolean 
isStreaming): DataFrame
{code}

since they finally calls SparkSession.internalCreateDataFrame which has 
isStreaming field, it is just passing additional parameter. Given we don't 
allow default parameter for developer api to support interop with Java, 4 new 
methods should be introduced instead of fixing existing 4 methods.

> Enable streaming source support on DSv1 
> 
>
> Key: SPARK-29361
> URL: https://issues.apache.org/jira/browse/SPARK-29361
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some 
> times before, Spark community suggested to not deal with old DSv2 and wait 
> for new DSv2. 
> The only consistent option between Spark 2.x and 3.x is DSv1, but supporting 
> streaming source is missed in DSv1. This issue tracks the effort to add 
> support for streaming source on DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29361) Enable streaming source support on DSv1

2019-10-04 Thread Jungtaek Lim (Jira)
Jungtaek Lim created SPARK-29361:


 Summary: Enable streaming source support on DSv1 
 Key: SPARK-29361
 URL: https://issues.apache.org/jira/browse/SPARK-29361
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Jungtaek Lim


DSv2 is heavily diverged between Spark 2.x and 3.x, and started from some times 
before, Spark community suggested to not deal with old DSv2 and wait for new 
DSv2. 

The only consistent option between Spark 2.x and 3.x is DSv1, but supporting 
streaming source is missed in DSv1. This issue tracks the effort to add support 
for streaming source on DSv1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-10-04 Thread Kangtian (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944984#comment-16944984
 ] 

Kangtian commented on SPARK-29267:
--

[~hyukjin.kwon]

Just finish when timeout, we don't need final value ~

In my case, i used *partly partitions* to get approximate counting, without 
timeout param. (in my case, more then 100 partitions)

 

!image-2019-10-05-12-38-26-867.png!

!image-2019-10-05-12-38-52-039.png!

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kangtian
>Priority: Minor
> Attachments: image-2019-10-05-12-37-22-927.png, 
> image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png
>
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-10-04 Thread Kangtian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kangtian updated SPARK-29267:
-
Attachment: image-2019-10-05-12-38-52-039.png

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kangtian
>Priority: Minor
> Attachments: image-2019-10-05-12-37-22-927.png, 
> image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png
>
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-10-04 Thread Kangtian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kangtian updated SPARK-29267:
-
Attachment: image-2019-10-05-12-37-22-927.png

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kangtian
>Priority: Minor
> Attachments: image-2019-10-05-12-37-22-927.png
>
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29267) rdd.countApprox should stop when 'timeout'

2019-10-04 Thread Kangtian (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29267?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kangtian updated SPARK-29267:
-
Attachment: image-2019-10-05-12-38-26-867.png

> rdd.countApprox should stop when 'timeout'
> --
>
> Key: SPARK-29267
> URL: https://issues.apache.org/jira/browse/SPARK-29267
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Kangtian
>Priority: Minor
> Attachments: image-2019-10-05-12-37-22-927.png, 
> image-2019-10-05-12-38-26-867.png, image-2019-10-05-12-38-52-039.png
>
>
> {{The way to Approximate counting: org.apache.spark.rdd.RDD#countApprox}}
> +countApprox(timeout: Long, confidence: Double = 0.95)+
>  
> But: 
> when timeout comes, the job will continue run until really finish.
>  
> We Want:
> *When timeout comes, the job will finish{color:#FF} immediately{color}*, 
> without FinalValue
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-04 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944927#comment-16944927
 ] 

L. C. Hsieh commented on SPARK-29358:
-

And I also concern that it is more far away from SQL union.

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-04 Thread L. C. Hsieh (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944899#comment-16944899
 ] 

L. C. Hsieh commented on SPARK-29358:
-

My concern is it breaks current behavior of unionByName. 

> Make unionByName optionally fill missing columns with nulls
> ---
>
> Key: SPARK-29358
> URL: https://issues.apache.org/jira/browse/SPARK-29358
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Mukul Murthy
>Priority: Major
>
> Currently, unionByName requires two DataFrames to have the same set of 
> columns (even though the order can be different). It would be good to add 
> either an option to unionByName or a new type of union which fills in missing 
> columns with nulls. 
> {code:java}
> val df1 = Seq(1, 2, 3).toDF("x")
> val df2 = Seq("a", "b", "c").toDF("y")
> df1.unionByName(df2){code}
> This currently throws 
> {code:java}
> org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
> (y);
> {code}
> Ideally, there would be a way to make this return a DataFrame containing:
> {code:java}
> +++ 
> | x| y| 
> +++ 
> | 1|null| 
> | 2|null| 
> | 3|null| 
> |null| a| 
> |null| b| 
> |null| c| 
> +++
> {code}
> Currently the workaround to make this possible is by using unionByName, but 
> this is clunky:
> {code:java}
> df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-10-04 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944879#comment-16944879
 ] 

Sean R. Owen edited comment on SPARK-27681 at 10/4/19 10:33 PM:


I looked at this one more time today after clearing out some earlier 2.13 
related issues.

I'm pretty sure this is what we should do, all in all, which is much more in 
line with [~vanzin]'s take:

- Generally, don't change all the Spark {{Seq}} usages in methods and return 
values. Just too much change.
- Definitely fix all the compile errors within Spark that result in 2.13, by 
adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from 
mutable Seqs, and Maps from MapViews (similar but different change in 2.13). 
This is what SPARK-29292 covers.
- ... and that may be it. Maybe we find a few corner cases where a public API 
method really does need to fix its Seq type to make sense, but I hadn't found 
it yet after fixing most of core

Yes, this means that user apps will experience many of the same compile errors 
when moving to 2.13. But that's true of any app at all moving from 2.12 to 
2.13, and they're relatively easy to fix into a form that works on 2.12 and 
2.13 by explicitly calling {{.toSeq}} etc. I don't think we need to fix that 
for users.


was (Author: srowen):
I looked at this one more time today after clearing out some earlier 2.13 
related issues.

I'm pretty sure this is what we should do, all in all, which is much more in 
line with [~vanzin]'s take:

- Generally, don't change all the Spark {{Seq}} usages in methods and return 
values. Just too much change.
- Definitely fix all the compile errors within Spark that result in 2.13, by 
adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from 
mutable Seqs, and Maps from MapViews (similar but different change in 2.13). 
This is what SPARK-29292 covers.
- ... and that may be it. Maybe we find a few corner cases where a public API 
method really does need to fix its Seq type to make sense, but I hadn't found 
it yet after fixing most of core

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. In many cases, this will be 
> fine, as Spark users using Scala 2.13 will also have this changed alias. In 
> some cases it may be undesirable, as it will cause some code to compile in 
> 2.12 but not in 2.13. In some cases, making the type {{scala.collection.Seq}} 
> explicit so that it doesn't vary can help avoid this, so that Spark apps 
> might cross-compile for 2.12 and 2.13 with the same source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-27681) Use scala.collection.Seq explicitly instead of scala.Seq alias

2019-10-04 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-27681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944879#comment-16944879
 ] 

Sean R. Owen commented on SPARK-27681:
--

I looked at this one more time today after clearing out some earlier 2.13 
related issues.

I'm pretty sure this is what we should do, all in all, which is much more in 
line with [~vanzin]'s take:

- Generally, don't change all the Spark {{Seq}} usages in methods and return 
values. Just too much change.
- Definitely fix all the compile errors within Spark that result in 2.13, by 
adding {{.toSeq}} or {{.toMap}} where applicable to get immutable versions from 
mutable Seqs, and Maps from MapViews (similar but different change in 2.13). 
This is what SPARK-29292 covers.
- ... and that may be it. Maybe we find a few corner cases where a public API 
method really does need to fix its Seq type to make sense, but I hadn't found 
it yet after fixing most of core

> Use scala.collection.Seq explicitly instead of scala.Seq alias
> --
>
> Key: SPARK-27681
> URL: https://issues.apache.org/jira/browse/SPARK-27681
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, Spark Core, SQL, Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Sean R. Owen
>Assignee: Sean R. Owen
>Priority: Major
>
> {{scala.Seq}} is widely used in the code, and is an alias for 
> {{scala.collection.Seq}} in Scala 2.12. It will become an alias for 
> {{scala.collection.immutable.Seq}} in Scala 2.13. In many cases, this will be 
> fine, as Spark users using Scala 2.13 will also have this changed alias. In 
> some cases it may be undesirable, as it will cause some code to compile in 
> 2.12 but not in 2.13. In some cases, making the type {{scala.collection.Seq}} 
> explicit so that it doesn't vary can help avoid this, so that Spark apps 
> might cross-compile for 2.12 and 2.13 with the same source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.

2019-10-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-28813:


Assignee: Huaxin Gao

> Document SHOW CREATE TABLE in SQL Reference.
> 
>
> Key: SPARK-28813
> URL: https://issues.apache.org/jira/browse/SPARK-28813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Huaxin Gao
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.

2019-10-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-28813.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25885
[https://github.com/apache/spark/pull/25885]

> Document SHOW CREATE TABLE in SQL Reference.
> 
>
> Key: SPARK-28813
> URL: https://issues.apache.org/jira/browse/SPARK-28813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Huaxin Gao
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28813) Document SHOW CREATE TABLE in SQL Reference.

2019-10-04 Thread Sean R. Owen (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-28813:
-
Priority: Minor  (was: Major)

> Document SHOW CREATE TABLE in SQL Reference.
> 
>
> Key: SPARK-28813
> URL: https://issues.apache.org/jira/browse/SPARK-28813
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SQL
>Affects Versions: 2.4.3
>Reporter: Dilip Biswal
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29108) Add window.sql - Part 2

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29108:
--
Fix Version/s: (was: 3.0.0)

> Add window.sql - Part 2
> ---
>
> Key: SPARK-29108
> URL: https://issues.apache.org/jira/browse/SPARK-29108
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L320-L562]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29109) Add window.sql - Part 3

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29109:
--
Fix Version/s: (was: 3.0.0)

> Add window.sql - Part 3
> ---
>
> Key: SPARK-29109
> URL: https://issues.apache.org/jira/browse/SPARK-29109
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L553-L911]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29110) Add window.sql - Part 4

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29110:
--
Fix Version/s: (was: 3.0.0)

> Add window.sql - Part 4
> ---
>
> Key: SPARK-29110
> URL: https://issues.apache.org/jira/browse/SPARK-29110
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> [https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259|https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L912-L1259]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29107) Add window.sql - Part 1

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29107:
--
Fix Version/s: (was: 3.0.0)

> Add window.sql - Part 1
> ---
>
> Key: SPARK-29107
> URL: https://issues.apache.org/jira/browse/SPARK-29107
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Tests
>Affects Versions: 3.0.0
>Reporter: Dylan Guedes
>Priority: Major
>
> In this ticket, we plan to add the regression test cases of 
> https://github.com/postgres/postgres/blob/REL_12_BETA3/src/test/regress/sql/window.sql#L1-L319



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29323:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Add tooltip for The Executors Tab's column names in the Spark history server 
> Page
> -
>
> Key: SPARK-29323
> URL: https://issues.apache.org/jira/browse/SPARK-29323
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: liucht-inspur
>Priority: Major
> Attachments: image-2019-10-04-09-42-14-174.png
>
>
> the spark Executors of history Tab page, the Summary part shows the line in 
> the list of title, but format is irregular.
> Some column names have tooltip, such as Storage Memory, Task Time(GC Time), 
> Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some 
> list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, 
> Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, 
> Executors section below,All the column names Contains the column names above 
> have tooltip .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944827#comment-16944827
 ] 

Dongjoon Hyun commented on SPARK-29323:
---

Hi, [~liucht-inspur]. Thank you for creating a JIRA, but please don't use 
`Fixed Version` because this is not fixed yet.
For the detail, please see http://spark.apache.org/contributing.html .

> Add tooltip for The Executors Tab's column names in the Spark history server 
> Page
> -
>
> Key: SPARK-29323
> URL: https://issues.apache.org/jira/browse/SPARK-29323
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: liucht-inspur
>Priority: Major
> Fix For: 2.4.4
>
> Attachments: image-2019-10-04-09-42-14-174.png
>
>
> the spark Executors of history Tab page, the Summary part shows the line in 
> the list of title, but format is irregular.
> Some column names have tooltip, such as Storage Memory, Task Time(GC Time), 
> Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some 
> list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, 
> Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, 
> Executors section below,All the column names Contains the column names above 
> have tooltip .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29323) Add tooltip for The Executors Tab's column names in the Spark history server Page

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29323:
--
Fix Version/s: (was: 2.4.4)

> Add tooltip for The Executors Tab's column names in the Spark history server 
> Page
> -
>
> Key: SPARK-29323
> URL: https://issues.apache.org/jira/browse/SPARK-29323
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 2.4.4
>Reporter: liucht-inspur
>Priority: Major
> Attachments: image-2019-10-04-09-42-14-174.png
>
>
> the spark Executors of history Tab page, the Summary part shows the line in 
> the list of title, but format is irregular.
> Some column names have tooltip, such as Storage Memory, Task Time(GC Time), 
> Input, Shuffle Read,Shuffle Write and Blacklisted, but there are still some 
> list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, 
> Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, 
> Executors section below,All the column names Contains the column names above 
> have tooltip .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-10-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944826#comment-16944826
 ] 

Dongjoon Hyun commented on SPARK-29225:
---

Hi, [~angerszhuuu]. Please use `3.0.0` for the improvement issue.

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Minor
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> *spark*
> !current saprk.jpg!  
> *HIVE*
> !Screen Shot 2019-09-24 at 9.26.14 PM.png!
> *Spark SQL* *expected*
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29225:
--
Priority: Minor  (was: Major)

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: angerszhu
>Priority: Minor
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> *spark*
> !current saprk.jpg!  
> *HIVE*
> !Screen Shot 2019-09-24 at 9.26.14 PM.png!
> *Spark SQL* *expected*
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29225) Spark SQL 'DESC FORMATTED TABLE' show different format with hive

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29225:
--
Affects Version/s: (was: 2.4.0)

> Spark SQL 'DESC FORMATTED TABLE' show different format with hive
> 
>
> Key: SPARK-29225
> URL: https://issues.apache.org/jira/browse/SPARK-29225
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: angerszhu
>Priority: Minor
> Attachments: Screen Shot 2019-09-24 at 9.14.39 PM.png, Screen Shot 
> 2019-09-24 at 9.26.14 PM.png, current saprk.jpg
>
>
> Current `DESC FORMATTED TABLE` show different table desc format, this problem 
> cause HUE can't parser column information corretltly :
> *spark*
> !current saprk.jpg!  
> *HIVE*
> !Screen Shot 2019-09-24 at 9.26.14 PM.png!
> *Spark SQL* *expected*
>   !Screen Shot 2019-09-24 at 9.14.39 PM.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25440) Dump query execution info to a file

2019-10-04 Thread Maxim Gekk (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944822#comment-16944822
 ] 

Maxim Gekk commented on SPARK-25440:


[~shashidha...@gmail.com] Call *df.queryExecution.debug.toFile()*. 
You can see examples in this test suite: 
https://github.com/apache/spark/blob/97dc4c0bfc3a15d364a376c6f87cb921d8d6980d/sql/core/src/test/scala/org/apache/spark/sql/execution/QueryExecutionSuite.scala

> Dump query execution info to a file
> ---
>
> Key: SPARK-25440
> URL: https://issues.apache.org/jira/browse/SPARK-25440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Output of the explain() doesn't contain full information and in some cases 
> can be truncated. Besides of that it saves info to a string in memory which 
> can cause OOM. The ticket aims to solve the problem and dump info about query 
> execution to a file. Need to add new method to queryExecution.debug which 
> accepts a path to a file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29360) PySpark FPGrowthModel supports getter/setter

2019-10-04 Thread Huaxin Gao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944821#comment-16944821
 ] 

Huaxin Gao commented on SPARK-29360:


I will submit a PR soon. 

> PySpark FPGrowthModel supports getter/setter
> 
>
> Key: SPARK-29360
> URL: https://issues.apache.org/jira/browse/SPARK-29360
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29360) PySpark FPGrowthModel supports getter/setter

2019-10-04 Thread Huaxin Gao (Jira)
Huaxin Gao created SPARK-29360:
--

 Summary: PySpark FPGrowthModel supports getter/setter
 Key: SPARK-29360
 URL: https://issues.apache.org/jira/browse/SPARK-29360
 Project: Spark
  Issue Type: Sub-task
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: Huaxin Gao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-25753) binaryFiles broken for small files

2019-10-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944817#comment-16944817
 ] 

Dongjoon Hyun edited comment on SPARK-25753 at 10/4/19 8:51 PM:


This is merged to master via https://github.com/apache/spark/pull/22725 , and
This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 
.


was (Author: dongjoon):
This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 
.

> binaryFiles broken for small files
> --
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4, 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> _{{StreamFileInputFormat}}_ and 
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
>  have the same problem: for small sized files, the computed maxSplitSize by 
> `_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
> used split size of 64/128M and spark throws an exception while trying to read 
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split 
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot 
> be larger than maximum split size 4194304 at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
>  201) at 
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
> scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25753) binaryFiles broken for small files

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25753:
--
Fix Version/s: 2.4.5

> binaryFiles broken for small files
> --
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4, 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> _{{StreamFileInputFormat}}_ and 
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
>  have the same problem: for small sized files, the computed maxSplitSize by 
> `_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
> used split size of 64/128M and spark throws an exception while trying to read 
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split 
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot 
> be larger than maximum split size 4194304 at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
>  201) at 
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
> scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25753) binaryFiles broken for small files

2019-10-04 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944817#comment-16944817
 ] 

Dongjoon Hyun commented on SPARK-25753:
---

This is backported to branch-2.4 via https://github.com/apache/spark/pull/26026 
.

> binaryFiles broken for small files
> --
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4, 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 2.4.5, 3.0.0
>
>
> _{{StreamFileInputFormat}}_ and 
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
>  have the same problem: for small sized files, the computed maxSplitSize by 
> `_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
> used split size of 64/128M and spark throws an exception while trying to read 
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split 
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot 
> be larger than maximum split size 4194304 at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
>  201) at 
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
> scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25753) binaryFiles broken for small files

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-25753?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25753:
--
Affects Version/s: 2.4.4

> binaryFiles broken for small files
> --
>
> Key: SPARK-25753
> URL: https://issues.apache.org/jira/browse/SPARK-25753
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.4.4, 3.0.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Minor
> Fix For: 3.0.0
>
>
> _{{StreamFileInputFormat}}_ and 
> {{_WholeTextFileInputFormat_(https://issues.apache.org/jira/browse/SPARK-24610)}}
>  have the same problem: for small sized files, the computed maxSplitSize by 
> `_{{StreamFileInputFormat}}_ `  is way smaller than the default or commonly 
> used split size of 64/128M and spark throws an exception while trying to read 
> them.
> {{Exception info:}}
> _{{Minimum split size pernode 5123456 cannot be larger than maximum split 
> size 4194304 java.io.IOException: Minimum split size pernode 5123456 cannot 
> be larger than maximum split size 4194304 at 
> org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java:
>  201) at 
> org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254) at 
> org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at 
> scala.Option.getOrElse(Option.scala:121) at 
> org.apache.spark.rdd.RDD.partitions(RDD.scala:252) at 
> org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)}}_



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24540) Support for multiple character delimiter in Spark CSV read

2019-10-04 Thread Jeff Evans (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jeff Evans updated SPARK-24540:
---
Summary: Support for multiple character delimiter in Spark CSV read  (was: 
Support for multiple delimiter in Spark CSV read)

> Support for multiple character delimiter in Spark CSV read
> --
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24540) Support for multiple delimiter in Spark CSV read

2019-10-04 Thread Jeff Evans (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-24540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944783#comment-16944783
 ] 

Jeff Evans commented on SPARK-24540:


I created a pull request to support this (which was linked above).  I'm not 
entirely clear on why SPARK-17967 would be a blocker, though.

> Support for multiple delimiter in Spark CSV read
> 
>
> Key: SPARK-24540
> URL: https://issues.apache.org/jira/browse/SPARK-24540
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Ashwin K
>Priority: Major
>
> Currently, the delimiter option Spark 2.0 to read and split CSV files/data 
> only support a single character delimiter. If we try to provide multiple 
> delimiters, we observer the following error message.
> eg: Dataset df = spark.read().option("inferSchema", "true")
>                                                           .option("header", 
> "false")
>                                                          .option("delimiter", 
> ", ")
>                                                           .csv("C:\test.txt");
> Exception in thread "main" java.lang.IllegalArgumentException: Delimiter 
> cannot be more than one character: , 
> at 
> org.apache.spark.sql.execution.datasources.csv.CSVUtils$.toChar(CSVUtils.scala:111)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:83)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVOptions.(CSVOptions.scala:39)
>  at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.inferSchema(CSVFileFormat.scala:55)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$8.apply(DataSource.scala:202)
>  at scala.Option.orElse(Option.scala:289)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:201)
>  at 
> org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:392)
>  at 
> org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
>  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
>  at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
>  
> Generally, the data to be processed contains multiple delimiters and 
> presently we need to do a manual data clean up on the source/input file, 
> which doesn't work well in large applications which consumes numerous files.
> There seems to be work-around like reading data as text and using the split 
> option, but this in my opinion defeats the purpose, advantage and efficiency 
> of a direct read from CSV file.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29359) Better exception handling in SQLQueryTestSuite and ThriftServerQueryTestSuite

2019-10-04 Thread Peter Toth (Jira)
Peter Toth created SPARK-29359:
--

 Summary: Better exception handling in SQLQueryTestSuite and 
ThriftServerQueryTestSuite
 Key: SPARK-29359
 URL: https://issues.apache.org/jira/browse/SPARK-29359
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 3.0.0
Reporter: Peter Toth


SQLQueryTestSuite and ThriftServerQueryTestSuite should have the same exception 
handling to avoid issues like this:
{noformat}
  Expected "[Recursion level limit 100 reached but query has not exhausted, try 
increasing spark.sql.cte.recursion.level.limit
  org.apache.spark.SparkException]", but got "[org.apache.spark.SparkException
  Recursion level limit 100 reached but query has not exhausted, try increasing 
spark.sql.cte.recursion.level.limit]"
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29340) Spark Sql executions do not use thread local jobgroup

2019-10-04 Thread Navdeep Poonia (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Navdeep Poonia resolved SPARK-29340.

Resolution: Not A Problem

Hi [~hyukjin.kwon] thanks for your response, you made me question our internal 
codebase and after few hours of debugging it turns out that one of scala 
collection had .par iteration on spark actions which will create new threads 
and hence thread local spark configs were not propagated to task scheduler. Job 
groups with spark sql are working perfectly as expected.

> Spark Sql executions do not use thread local jobgroup
> -
>
> Key: SPARK-29340
> URL: https://issues.apache.org/jira/browse/SPARK-29340
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Navdeep Poonia
>Priority: Major
>
> val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
> sparkThreadLocal.sparkContext.setJobGroup("", "")
> OR
> sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", 
> "")
> sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", 
> "")
>  
> The jobgroup property works fine for spark jobs/stages created by spark 
> dataframe operations but in case of sparksql the jobgroup is randomly 
> assigned to stages or is null sometimes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Labels: release-notes  (was: )

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
>  Labels: release-notes
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  Below shows a basic example of DPP.
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Description: 
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 Below shows a basic example of DPP.

!image-2019-10-04-11-20-02-616.png|width=521,height=225!

  was:
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 Below is an example to show how it takes an effect

!image-2019-10-04-11-20-02-616.png|width=521,height=225!


> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  Below shows a basic example of DPP.
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Description: 
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 Below is an example to show how it takes an effect

!image-2019-10-04-11-20-02-616.png|width=521,height=225!

  was:
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 

!image-2019-10-04-11-20-02-616.png|width=521,height=225!


> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  Below is an example to show how it takes an effect
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Attachment: image-2019-10-04-11-20-02-616.png

> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Partitions are not pruned when joined on the partition columns.
> This is the same issue as HIVE-9152.
> Ex: 
> Select  from tab where partcol=1 will prune on value 1
> Select  from tab join dim on (dim.partcol=tab.partcol) where 
> dim.partcol=1 will scan all partitions.
> Tables are based on parquets.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Description: 
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 

!image-2019-10-04-11-20-02-616.png!

  was:
Partitions are not pruned when joined on the partition columns.
This is the same issue as HIVE-9152.
Ex: 
Select  from tab where partcol=1 will prune on value 1
Select  from tab join dim on (dim.partcol=tab.partcol) where dim.partcol=1 
will scan all partitions.
Tables are based on parquets.


> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  
> !image-2019-10-04-11-20-02-616.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11150) Dynamic partition pruning

2019-10-04 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-11150?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-11150:

Description: 
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 

!image-2019-10-04-11-20-02-616.png|width=521,height=225!

  was:
Implements dynamic partition pruning by adding a dynamic-partition-pruning 
filter if there is a partitioned table and a filter on the dimension table. The 
filter is then planned using a heuristic approach:
 # As a broadcast relation if it is a broadcast hash join. The broadcast 
relation will then be transformed into a reused broadcast exchange by the 
{{ReuseExchange}} rule; or
 # As a subquery duplicate if the estimated benefit of partition table scan 
being saved is greater than the estimated cost of the extra scan of the 
duplicated subquery; otherwise
 # As a bypassed condition ({{true}}).

 

!image-2019-10-04-11-20-02-616.png!


> Dynamic partition pruning
> -
>
> Key: SPARK-11150
> URL: https://issues.apache.org/jira/browse/SPARK-11150
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 1.5.1, 1.6.0, 2.0.0, 2.1.2, 2.2.1, 2.3.0
>Reporter: Younes
>Assignee: Wei Xue
>Priority: Major
> Fix For: 3.0.0
>
> Attachments: image-2019-10-04-11-20-02-616.png
>
>
> Implements dynamic partition pruning by adding a dynamic-partition-pruning 
> filter if there is a partitioned table and a filter on the dimension table. 
> The filter is then planned using a heuristic approach:
>  # As a broadcast relation if it is a broadcast hash join. The broadcast 
> relation will then be transformed into a reused broadcast exchange by the 
> {{ReuseExchange}} rule; or
>  # As a subquery duplicate if the estimated benefit of partition table scan 
> being saved is greater than the estimated cost of the extra scan of the 
> duplicated subquery; otherwise
>  # As a bypassed condition ({{true}}).
>  
> !image-2019-10-04-11-20-02-616.png|width=521,height=225!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-04 Thread Srini E (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944736#comment-16944736
 ] 

Srini E commented on SPARK-29337:
-

Hi Wang,

We don't have cache options while we are trying to cache a table through 
spark-sql.We can just use cache table table name and there are no options like 
storage level.

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29358) Make unionByName optionally fill missing columns with nulls

2019-10-04 Thread Mukul Murthy (Jira)
Mukul Murthy created SPARK-29358:


 Summary: Make unionByName optionally fill missing columns with 
nulls
 Key: SPARK-29358
 URL: https://issues.apache.org/jira/browse/SPARK-29358
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Mukul Murthy


Currently, unionByName requires two DataFrames to have the same set of columns 
(even though the order can be different). It would be good to add either an 
option to unionByName or a new type of union which fills in missing columns 
with nulls. 
{code:java}
val df1 = Seq(1, 2, 3).toDF("x")
val df2 = Seq("a", "b", "c").toDF("y")
df1.unionByName(df2){code}
This currently throws 
{code:java}
org.apache.spark.sql.AnalysisException: Cannot resolve column name "x" among 
(y);
{code}
Ideally, there would be a way to make this return a DataFrame containing:
{code:java}
+++ 
| x| y| 
+++ 
| 1|null| 
| 2|null| 
| 3|null| 
|null| a| 
|null| b| 
|null| c| 
+++
{code}
Currently the workaround to make this possible is by using unionByName, but 
this is clunky:
{code:java}
df1.withColumn("y", lit(null)).unionByName(df2.withColumn("x", lit(null)))
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29357) Fix the flaky test in DataFrameSuite

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29357:
--
Priority: Minor  (was: Major)

> Fix the flaky test in DataFrameSuite
> 
>
> Key: SPARK-29357
> URL: https://issues.apache.org/jira/browse/SPARK-29357
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Minor
>
> Fix the test `SPARK-25159: json schema inference should only trigger one job` 
> by changing to use AtomicLong instead of a var that will not always be 
> updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29357) Fix the flaky test in DataFrameSuite

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29357:
--
Issue Type: Bug  (was: Improvement)

> Fix the flaky test in DataFrameSuite
> 
>
> Key: SPARK-29357
> URL: https://issues.apache.org/jira/browse/SPARK-29357
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Priority: Major
>
> Fix the test `SPARK-25159: json schema inference should only trigger one job` 
> by changing to use AtomicLong instead of a var that will not always be 
> updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29357) Fix the flaky test in DataFrameSuite

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29357.
---
Fix Version/s: 3.0.0
 Assignee: Yuanjian Li
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26020

> Fix the flaky test in DataFrameSuite
> 
>
> Key: SPARK-29357
> URL: https://issues.apache.org/jira/browse/SPARK-29357
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 3.0.0
>Reporter: Yuanjian Li
>Assignee: Yuanjian Li
>Priority: Minor
> Fix For: 3.0.0
>
>
> Fix the test `SPARK-25159: json schema inference should only trigger one job` 
> by changing to use AtomicLong instead of a var that will not always be 
> updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29286:
--
Affects Version/s: 2.4.4

> UnicodeDecodeError raised when running python tests on arm instance
> ---
>
> Key: SPARK-29286
> URL: https://issues.apache.org/jira/browse/SPARK-29286
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.4.4, 3.0.0
>Reporter: huangtianhua
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Run command 'python/run-tests --python-executables=python2.7,python3.6' on 
> arm instance, then UnicodeDecodeError raised:
> 
> Starting test(python2.7): pyspark.broadcast
> Got an exception while trying to store skipped test output:
> Traceback (most recent call last):
>  File "./python/run-tests.py", line 137, in run_individual_python_test
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
>  File "./python/run-tests.py", line 137, in 
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: 
> ordinal not in range(128)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-29286:
--
Fix Version/s: 2.4.5

> UnicodeDecodeError raised when running python tests on arm instance
> ---
>
> Key: SPARK-29286
> URL: https://issues.apache.org/jira/browse/SPARK-29286
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.5, 3.0.0
>
>
> Run command 'python/run-tests --python-executables=python2.7,python3.6' on 
> arm instance, then UnicodeDecodeError raised:
> 
> Starting test(python2.7): pyspark.broadcast
> Got an exception while trying to store skipped test output:
> Traceback (most recent call last):
>  File "./python/run-tests.py", line 137, in run_individual_python_test
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
>  File "./python/run-tests.py", line 137, in 
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: 
> ordinal not in range(128)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29286) UnicodeDecodeError raised when running python tests on arm instance

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29286.
---
Fix Version/s: 3.0.0
 Assignee: Hyukjin Kwon
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26021

> UnicodeDecodeError raised when running python tests on arm instance
> ---
>
> Key: SPARK-29286
> URL: https://issues.apache.org/jira/browse/SPARK-29286
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 3.0.0
>Reporter: huangtianhua
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.0.0
>
>
> Run command 'python/run-tests --python-executables=python2.7,python3.6' on 
> arm instance, then UnicodeDecodeError raised:
> 
> Starting test(python2.7): pyspark.broadcast
> Got an exception while trying to store skipped test output:
> Traceback (most recent call last):
>  File "./python/run-tests.py", line 137, in run_individual_python_test
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
>  File "./python/run-tests.py", line 137, in 
>  decoded_lines = map(lambda line: line.decode(), iter(per_test_output))
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xe4 in position 51: 
> ordinal not in range(128)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29355) Support timestamps subtraction

2019-10-04 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-29355.
---
Fix Version/s: 3.0.0
 Assignee: Maxim Gekk
   Resolution: Fixed

This is resolved via https://github.com/apache/spark/pull/26022

> Support timestamps subtraction
> --
>
> Key: SPARK-29355
> URL: https://issues.apache.org/jira/browse/SPARK-29355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.0.0
>
>
> ||Operator||Example||Result||
> |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
> 12:00'}}|{{interval '1 day 15:00:00'}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13587) Support virtualenv in PySpark

2019-10-04 Thread Furcy Pin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-13587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944564#comment-16944564
 ] 

Furcy Pin commented on SPARK-13587:
---

Hello,

I don't know where to ask this, but we have been using this feature on 
HDInsight 2.6.5 and we sometimes have a concurrency issue with pip.
 Basically it looks like in rare occasions, several executors set up the 
virtualenv simultaneously, which ends up in a kind of deadlock.

When running the pip install command used by the executor manually, it suddenly 
hangs and when cancel throws this error :
{code:java}
File 
"/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_XXX/container_XXX/virtualenv_application_XXX/lib/python3.5/site-packages/pip/_vendor/lockfile/linklockfile.py",
 line 31, in acquire
 os.link(self.unique_name, self.lock_file)
 FileExistsError: [Errno 17] File exists: '/home/yarn/-' -> 
'/home/yarn/selfcheck.json.lock'{code}
This happens with "spark.pyspark.virtualenv.type=native". 
We haven't tried with conda yet.

It is pretty bad because when it happens:
 - some executors of the spark job just get stuck, and the spark job gets stuck
 - even if the job is restarted, the lock files stays there and causes the 
whole YARN host to be useless.

Any suggestion or workaround would be appreciated. 
 One idea would be to remove the "--cache-dir /home/yarn" option which is 
currently used in the pip install command, but it doesn't seem to be 
configurable right now.

> Support virtualenv in PySpark
> -
>
> Key: SPARK-13587
> URL: https://issues.apache.org/jira/browse/SPARK-13587
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark
>Affects Versions: 1.6.3, 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Jeff Zhang
>Priority: Major
>
> Currently, it's not easy for user to add third party python packages in 
> pyspark.
> * One way is to using --py-files (suitable for simple dependency, but not 
> suitable for complicated dependency, especially with transitive dependency)
> * Another way is install packages manually on each node (time wasting, and 
> not easy to switch to different environment)
> Python has now 2 different virtualenv implementation. One is native 
> virtualenv another is through conda. This jira is trying to migrate these 2 
> tools to distributed environment



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29276) Spark job fails because of timeout to Driver

2019-10-04 Thread Jochen Hebbrecht (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944526#comment-16944526
 ] 

Jochen Hebbrecht commented on SPARK-29276:
--

Thanks, I've just send out a mail on the mailing list :-)

> Spark job fails because of timeout to Driver
> 
>
> Key: SPARK-29276
> URL: https://issues.apache.org/jira/browse/SPARK-29276
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Jochen Hebbrecht
>Priority: Major
>
> Hi,
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job 
> towards the cluster. Thhe job gets accepted, but the YARN application fails 
> with:
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: 
> java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
> 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: 
> Futures timed out after [10 milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
> It actually goes wrong at this line: 
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be 
> something wrong with my setup. I don't understand the code of the 
> ApplicationMaster, so could somebody explain me what it is trying to reach? 
> Where exactly does the connection timeout? So at least I can debug it further 
> because I don't have a clue what it is doing :-)
> Thanks for any help!
> Jochen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

--

[jira] [Commented] (SPARK-25440) Dump query execution info to a file

2019-10-04 Thread Shasidhar E S (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-25440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944434#comment-16944434
 ] 

Shasidhar E S commented on SPARK-25440:
---

[~maxgekk] How do we use this feature? Is there an example on how to configure 
file path to dump the query plan into a file? 

> Dump query execution info to a file
> ---
>
> Key: SPARK-25440
> URL: https://issues.apache.org/jira/browse/SPARK-25440
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Maxim Gekk
>Priority: Minor
> Fix For: 3.0.0
>
>
> Output of the explain() doesn't contain full information and in some cases 
> can be truncated. Besides of that it saves info to a string in memory which 
> can cause OOM. The ticket aims to solve the problem and dump info about query 
> execution to a file. Need to add new method to queryExecution.debug which 
> accepts a path to a file. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29357) Fix the flaky test in DataFrameSuite

2019-10-04 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-29357:
---

 Summary: Fix the flaky test in DataFrameSuite
 Key: SPARK-29357
 URL: https://issues.apache.org/jira/browse/SPARK-29357
 Project: Spark
  Issue Type: Improvement
  Components: Tests
Affects Versions: 3.0.0
Reporter: Yuanjian Li


Fix the test `SPARK-25159: json schema inference should only trigger one job` 
by changing to use AtomicLong instead of a var that will not always be updated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29356) Stopping Spark doesn't shut down all network connections

2019-10-04 Thread Malthe Borch (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Malthe Borch updated SPARK-29356:
-
Description: The Spark session's gateway client still has an open network 
connection after a call to `spark.stop()`. This is unexpected and for example 
in a test suite, this triggers a resource warning when tearing down the test 
case.  (was: The Spark session's gateway client still has an open network 
connection after a call to `spark.stop()`. This is unexpected and in for 
example a test suite, this triggers a resource warning when tearing down the 
test case.)

> Stopping Spark doesn't shut down all network connections
> 
>
> Key: SPARK-29356
> URL: https://issues.apache.org/jira/browse/SPARK-29356
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.4.4
>Reporter: Malthe Borch
>Priority: Minor
>
> The Spark session's gateway client still has an open network connection after 
> a call to `spark.stop()`. This is unexpected and for example in a test suite, 
> this triggers a resource warning when tearing down the test case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29356) Stopping Spark doesn't shut down all network connections

2019-10-04 Thread Malthe Borch (Jira)
Malthe Borch created SPARK-29356:


 Summary: Stopping Spark doesn't shut down all network connections
 Key: SPARK-29356
 URL: https://issues.apache.org/jira/browse/SPARK-29356
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.4.4
Reporter: Malthe Borch


The Spark session's gateway client still has an open network connection after a 
call to `spark.stop()`. This is unexpected and in for example a test suite, 
this triggers a resource warning when tearing down the test case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28425) Add more Date/Time Operators

2019-10-04 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-28425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-28425:
---
Description: 
||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html

  was:
||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html


> Add more Date/Time Operators
> 
>
> Key: SPARK-28425
> URL: https://issues.apache.org/jira/browse/SPARK-28425
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Operator||Example||Result||
> |{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
> 01:00:00'}}|
> |{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
> 23:00:00'}}|
> |{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
> |{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
> |{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
> |{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29355) Support timestamps subtraction

2019-10-04 Thread Maxim Gekk (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-29355:
---
Description: 
||Operator||Example||Result||
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html

  was:
||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html


> Support timestamps subtraction
> --
>
> Key: SPARK-29355
> URL: https://issues.apache.org/jira/browse/SPARK-29355
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Maxim Gekk
>Priority: Major
>
> ||Operator||Example||Result||
> |{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
> 12:00'}}|{{interval '1 day 15:00:00'}}|
> https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29355) Support timestamps subtraction

2019-10-04 Thread Maxim Gekk (Jira)
Maxim Gekk created SPARK-29355:
--

 Summary: Support timestamps subtraction
 Key: SPARK-29355
 URL: https://issues.apache.org/jira/browse/SPARK-29355
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Maxim Gekk


||Operator||Example||Result||
|{{+}}|{{date '2001-09-28' + interval '1 hour'}}|{{timestamp '2001-09-28 
01:00:00'}}|
|{{-}}|{{date '2001-09-28' - interval '1 hour'}}|{{timestamp '2001-09-27 
23:00:00'}}|
|{{-}}|{{timestamp '2001-09-29 03:00' - timestamp '2001-09-27 
12:00'}}|{{interval '1 day 15:00:00'}}|
|{{*}}|{{900 * interval '1 second'}}|{{interval '00:15:00'}}|
|{{*}}|{{21 * interval '1 day'}}|{{interval '21 days'}}|
|{{*}}|{{double precision '3.5' * interval '1 hour'}}|{{interval '03:30:00'}}|
|{{/}}|{{interval '1 hour' / double precision '1.5'}}|{{interval '00:40:00'}}|


https://www.postgresql.org/docs/11/functions-datetime.html



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29276) Spark job fails because of timeout to Driver

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29276.
--
Resolution: Invalid

Let's ask questions into mailing list or stackoverflow. You would be able to 
get a better answer.

> Spark job fails because of timeout to Driver
> 
>
> Key: SPARK-29276
> URL: https://issues.apache.org/jira/browse/SPARK-29276
> Project: Spark
>  Issue Type: Question
>  Components: Spark Core
>Affects Versions: 2.4.2
>Reporter: Jochen Hebbrecht
>Priority: Major
>
> Hi,
> I'm using Spark 2.4.2 on AWS EMR 5.24.0. I'm trying to send a Spark job 
> towards the cluster. Thhe job gets accepted, but the YARN application fails 
> with:
> {code}
> 19/09/27 14:33:35 ERROR ApplicationMaster: Uncaught exception: 
> java.util.concurrent.TimeoutException: Futures timed out after [10 
> milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> 19/09/27 14:33:35 INFO ApplicationMaster: Final app status: FAILED, exitCode: 
> 13, (reason: Uncaught exception: java.util.concurrent.TimeoutException: 
> Futures timed out after [10 milliseconds]
>   at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223)
>   at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227)
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:220)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:468)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$runImpl(ApplicationMaster.scala:305)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply$mcV$sp(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$run$1.apply(ApplicationMaster.scala:245)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:779)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1844)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.doAsUser(ApplicationMaster.scala:778)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:244)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
> {code}
> It actually goes wrong at this line: 
> https://github.com/apache/spark/blob/v2.4.2/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L468
> Now, I'm 100% sure Spark is OK and there's no bug, but there must be 
> something wrong with my setup. I don't understand the code of the 
> ApplicationMaster, so could somebody explain me what it is trying to reach? 
> Where exactly does the connection timeout? So at least I can debug it further 
> because I don't have a clue what it is doing :-)
> Thanks for any help!
> Jochen



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (SPARK-29302) dynamic partition overwrite with speculation enabled

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29302.
--
Resolution: Invalid

> dynamic partition overwrite with speculation enabled
> 
>
> Key: SPARK-29302
> URL: https://issues.apache.org/jira/browse/SPARK-29302
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: feiwang
>Priority: Major
>
> Now, for a dynamic partition overwrite operation,  the filename of a task 
> output is determinable.
> So, if speculation is enabled,  would a task conflict with  its relative 
> speculation task?
> Would the two tasks concurrent write a same file?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29304) Input Bytes Metric for Datasource v2 is absent

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-29304:
-
Target Version/s:   (was: 3.0.0)

> Input Bytes Metric for Datasource v2 is absent
> --
>
> Key: SPARK-29304
> URL: https://issues.apache.org/jira/browse/SPARK-29304
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 3.0.0
>Reporter: Kuhu Shukla
>Priority: Major
> Attachments: jira-spark.png
>
>
> Input metrics while reading a simple CSV file with 
> {code}
> spark.read.csv()
> {code}
> are absent. 
> Switching to v1 data source works. Adding the inputMetrics calculation from 
> FileScanRDD to DataSourceRDD helps get the values.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29309) StreamingContext.binaryRecordsStream() is useless

2019-10-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944318#comment-16944318
 ] 

Hyukjin Kwon commented on SPARK-29309:
--

So what do you propose?

> StreamingContext.binaryRecordsStream() is useless
> -
>
> Key: SPARK-29309
> URL: https://issues.apache.org/jira/browse/SPARK-29309
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 2.4.4
>Reporter: Alberto Andreotti
>Priority: Major
>
> Supporting only fixed length binary records turn this function really 
> difficult to use.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29316) CLONE - schemaInference option not to convert strings with leading zeros to int/long

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29316.
--
Resolution: Won't Fix

> CLONE - schemaInference option not to convert strings with leading zeros to 
> int/long 
> -
>
> Key: SPARK-29316
> URL: https://issues.apache.org/jira/browse/SPARK-29316
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.1.1, 2.2.0, 2.3.0
>Reporter: Ambar Raghuvanshi
>Priority: Critical
>  Labels: csv, csvparser, easy-fix, inference, ramp-up, schema
>
> It would be great to have an option in Spark's schema inference to *not* to 
> convert to int/long datatype a column that has leading zeros. Think zip 
> codes, for example.
> {code:java}
> df = (sqlc.read.format('csv')
>   .option('inferSchema', True)
>   .option('header', True)
>   .option('delimiter', '|')
>   .option('leadingZeros', 'KEEP')   # this is the new 
> proposed option
>   .option('mode', 'FAILFAST')
>   .load('csvfile_withzipcodes_to_ingest.csv')
> )
> The general usage of data with trailing 0 is for Identifiers. If they are 
> converted to int/long defeats the purpose of inferSchema. The conversion 
> should be provided on the basis of a flag whether the data should be 
> converted to int/long or not. {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29039) centralize the catalog and table lookup logic

2019-10-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29039.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25747
[https://github.com/apache/spark/pull/25747]

> centralize the catalog and table lookup logic
> -
>
> Key: SPARK-29039
> URL: https://issues.apache.org/jira/browse/SPARK-29039
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.0.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29337) How to Cache Table and Pin it in Memory and should not Spill to Disk on Thrift Server

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29337.
--
Resolution: Invalid

Questions should go to stackoverflow or mailing list.

> How to Cache Table and Pin it in Memory and should not Spill to Disk on 
> Thrift Server 
> --
>
> Key: SPARK-29337
> URL: https://issues.apache.org/jira/browse/SPARK-29337
> Project: Spark
>  Issue Type: Question
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Srini E
>Priority: Major
> Attachments: Cache+Image.png
>
>
> Hi Team,
> How to pin the table in cache so it would not swap out of memory?
> Situation: We are using Microstrategy BI reporting. Semantic layer is built. 
> We wanted to Cache highly used tables into CACHE using Spark SQL CACHE Table 
> ; we did cache for SPARK context( Thrift server). Please see 
> below snapshot of Cache table, went to disk over time. Initially it was all 
> in cache , now some in cache and some in disk. That disk may be local disk 
> relatively more expensive reading than from s3. Queries may take longer and 
> inconsistent times from user experience perspective. If More queries running 
> using Cache tables, copies of the cache table images are copied and copies 
> are not staying in memory causing reports to run longer. so how to pin the 
> table so would not swap to disk. Spark memory management is dynamic 
> allocation, and how to use those few tables to Pin in memory .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29340) Spark Sql executions do not use thread local jobgroup

2019-10-04 Thread Hyukjin Kwon (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16944314#comment-16944314
 ] 

Hyukjin Kwon commented on SPARK-29340:
--

[~navdeepniku], is it possible to show the full reproducer? 

> Spark Sql executions do not use thread local jobgroup
> -
>
> Key: SPARK-29340
> URL: https://issues.apache.org/jira/browse/SPARK-29340
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.4
>Reporter: Navdeep Poonia
>Priority: Major
>
> val sparkThreadLocal: SparkSession = DataCurator.spark.newSession()
> sparkThreadLocal.sparkContext.setJobGroup("", "")
> OR
> sparkThreadLocal.sparkContext.setLocalProperty("spark.job.description", 
> "")
> sparkThreadLocal.sparkContext.setLocalProperty("spark.jobGroup.id", 
> "")
>  
> The jobgroup property works fine for spark jobs/stages created by spark 
> dataframe operations but in case of sparksql the jobgroup is randomly 
> assigned to stages or is null sometimes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29344) Spark application hang

2019-10-04 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-29344.
--
Resolution: Cannot Reproduce

I haven't seen such things. Please specify the steps to reproduce and 
environemnt.

> Spark application hang
> --
>
> Key: SPARK-29344
> URL: https://issues.apache.org/jira/browse/SPARK-29344
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.1
>Reporter: Kitti
>Priority: Major
> Attachments: stderr
>
>
> We found the issue that Spark application hang and stop working sometime 
> without any log in Spark Driver until we killed the application. 
>  
> 19/10/03 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 117 19/10/03 
> 06:07:03 INFO spark.ContextCleaner: Cleaned accumulator 80 19/10/03 06:07:03 
> INFO spark.ContextCleaner: Cleaned accumulator 105 19/10/03 06:07:03 INFO 
> spark.ContextCleaner: Cleaned accumulator 88 19/10/03 10:36:59 ERROR 
> yarn.ApplicationMaster: RECEIVED SIGNAL TERM



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure

2019-10-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-29326:
---

Assignee: Gengliang Wang

> ANSI store assignment policy: throw exception on insertion failure
> --
>
> Key: SPARK-29326
> URL: https://issues.apache.org/jira/browse/SPARK-29326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> As per ANSI SQL standard,  ANSI store assignment policy should throw an 
> exception on insertion failure, such as inserting out-of-range value to a 
> numeric field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-29326) ANSI store assignment policy: throw exception on insertion failure

2019-10-04 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-29326.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 25997
[https://github.com/apache/spark/pull/25997]

> ANSI store assignment policy: throw exception on insertion failure
> --
>
> Key: SPARK-29326
> URL: https://issues.apache.org/jira/browse/SPARK-29326
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Gengliang Wang
>Priority: Major
> Fix For: 3.0.0
>
>
> As per ANSI SQL standard,  ANSI store assignment policy should throw an 
> exception on insertion failure, such as inserting out-of-range value to a 
> numeric field.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org