date:20200117

[jira] [Resolved] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels

2020-01-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30533.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27241
[https://github.com/apache/spark/pull/27241]

> Add classes to represent Java Regressors and RegressionModels
> -
>
> Key: SPARK-30533
> URL: https://issues.apache.org/jira/browse/SPARK-30533
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
> Fix For: 3.0.0
>
>
> Right now PySpark provides classed representing Java {{Classifiers}} and 
> {{ClassifierModels}}, but lacks their regression counterparts.
> We should provide these for consistency, feature parity and as prerequisite 
> for SPARK-29212.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels

2020-01-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen updated SPARK-30533:
-
Priority: Minor  (was: Major)

> Add classes to represent Java Regressors and RegressionModels
> -
>
> Key: SPARK-30533
> URL: https://issues.apache.org/jira/browse/SPARK-30533
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Minor
> Fix For: 3.0.0
>
>
> Right now PySpark provides classed representing Java {{Classifiers}} and 
> {{ClassifierModels}}, but lacks their regression counterparts.
> We should provide these for consistency, feature parity and as prerequisite 
> for SPARK-29212.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30533) Add classes to represent Java Regressors and RegressionModels

2020-01-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30533:


Assignee: Maciej Szymkiewicz

> Add classes to represent Java Regressors and RegressionModels
> -
>
> Key: SPARK-30533
> URL: https://issues.apache.org/jira/browse/SPARK-30533
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: Maciej Szymkiewicz
>Assignee: Maciej Szymkiewicz
>Priority: Major
>
> Right now PySpark provides classed representing Java {{Classifiers}} and 
> {{ClassifierModels}}, but lacks their regression counterparts.
> We should provide these for consistency, feature parity and as prerequisite 
> for SPARK-29212.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories

2020-01-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25993:
--
Issue Type: Improvement  (was: Bug)

> Add test cases for CREATE EXTERNAL TABLE with subdirectories
> 
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
> Fix For: 3.0.0
>
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories

2020-01-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25993:
--
Summary: Add test cases for CREATE EXTERNAL TABLE with subdirectories  
(was: Add test cases for resolution of ORC table location)

> Add test cases for CREATE EXTERNAL TABLE with subdirectories
> 
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-25993) Add test cases for CREATE EXTERNAL TABLE with subdirectories

2020-01-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-25993:
--
Labels:   (was: starter)

> Add test cases for CREATE EXTERNAL TABLE with subdirectories
> 
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
> Fix For: 3.0.0
>
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-25993) Add test cases for resolution of ORC table location

2020-01-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-25993:
-

Assignee: kevin yu

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
>  Labels: starter
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-25993) Add test cases for resolution of ORC table location

2020-01-17 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-25993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-25993.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 27130
[https://github.com/apache/spark/pull/27130]

> Add test cases for resolution of ORC table location
> ---
>
> Key: SPARK-25993
> URL: https://issues.apache.org/jira/browse/SPARK-25993
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.3.2
>Reporter: Xiao Li
>Assignee: kevin yu
>Priority: Major
>  Labels: starter
> Fix For: 3.0.0
>
>
> Add a test case based on the following example. The behavior was changed in 
> 2.3 release. We also need to upgrade the migration guide.
> {code:java}
> val someDF1 = Seq(
>   (1, 1, "blah"),
>   (1, 2, "blahblah")
> ).toDF("folder", "number", "word").repartition(1)
> someDF1.write.orc("/tmp/orctab1/dir1/")
> someDF1.write.orc("/mnt/orctab1/dir2/")
> create external table tab1(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/");
> select * from tab1;
> create external table tab2(folder int,number int,word string) STORED AS ORC 
> LOCATION '/tmp/orctab1/*");
> select * from tab2;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-17 Thread Nick Afshartous (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329
 ] 

Nick Afshartous edited comment on SPARK-27249 at 1/17/20 9:29 PM:
--

[~enrush] Hi Everett, 
The {{Dataset}} API has an experimental function {{mapPartitions}} for 
transforming {{Datasets}}.  Does this satisfy your requirements ?  

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset


was (Author: nafshartous):
[~enrush] Hi Everett, 
The {{Dataset}} API has an experimental function {{mapPartitions}} for 
transforming {{Dataset}}s.  Does this satisfy your requirements ?  

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-17 Thread Nick Afshartous (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329
 ] 

Nick Afshartous commented on SPARK-27249:
-

[~enrush] Hi Everett, 
The {{Dataset}} API has an experimental function {{mapPartitions}} for 
transforming {{Dataset}} .  Does this satisfy your requirements ?  

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-17 Thread Nick Afshartous (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018329#comment-17018329
 ] 

Nick Afshartous edited comment on SPARK-27249 at 1/17/20 9:28 PM:
--

[~enrush] Hi Everett, 
The {{Dataset}} API has an experimental function {{mapPartitions}} for 
transforming {{Dataset}}s.  Does this satisfy your requirements ?  

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset


was (Author: nafshartous):
[~enrush] Hi Everett, 
The {{Dataset}} API has an experimental function {{mapPartitions}} for 
transforming {{Dataset}} .  Does this satisfy your requirements ?  

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-17 Thread Nick Afshartous (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nick Afshartous updated SPARK-27249:

Attachment: Screen Shot 2020-01-17 at 4.20.57 PM.png

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
> Attachments: Screen Shot 2020-01-17 at 4.20.57 PM.png
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-17 Thread Marcelo Masiero Vanzin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018256#comment-17018256
 ] 

Marcelo Masiero Vanzin commented on SPARK-30557:


I don't exactly remember what that does, but a quick looks seem to indicate 
it's basically another way of setting JVM options, used in some internal code. 
We have {{--driver-java-options}} for users already.

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-17 Thread Nicholas Chammas (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018254#comment-17018254
 ] 

Nicholas Chammas commented on SPARK-30557:
--

[~vanzin] - Do you know if this is something we should document? Would the 
documentation go in [Submitting 
Applications|https://spark.apache.org/docs/latest/submitting-applications.html]?
 (Another possibility is to put it in spark-env.sh.template.)

> Add public documentation for SPARK_SUBMIT_OPTS
> --
>
> Key: SPARK-30557
> URL: https://issues.apache.org/jira/browse/SPARK-30557
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy, Documentation
>Affects Versions: 2.4.4
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
> documentation. I cannot see it documented 
> [anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
>  in the docs.
> How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-17 Thread Dongjoon Hyun (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018242#comment-17018242
 ] 

Dongjoon Hyun commented on SPARK-27868:
---

I see. Thank you, [~vanzin].

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-29876) Delete/archive file source completed files in separate thread

2020-01-17 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin resolved SPARK-29876.

Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26502
[https://github.com/apache/spark/pull/26502]

> Delete/archive file source completed files in separate thread
> -
>
> Key: SPARK-29876
> URL: https://issues.apache.org/jira/browse/SPARK-29876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.0.0
>
>
> SPARK-20568 added the possibility to clean up completed files in streaming 
> query. Deleting/archiving uses the main thread which can slow down 
> processing. It would be good to do this on separate thread(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-29876) Delete/archive file source completed files in separate thread

2020-01-17 Thread Marcelo Masiero Vanzin (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29876?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Masiero Vanzin reassigned SPARK-29876:
--

Assignee: Gabor Somogyi

> Delete/archive file source completed files in separate thread
> -
>
> Key: SPARK-29876
> URL: https://issues.apache.org/jira/browse/SPARK-29876
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.0.0
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> SPARK-20568 added the possibility to clean up completed files in streaming 
> query. Deleting/archiving uses the main thread which can slow down 
> processing. It would be good to do this on separate thread(s).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30557) Add public documentation for SPARK_SUBMIT_OPTS

2020-01-17 Thread Nicholas Chammas (Jira)

Nicholas Chammas created SPARK-30557:


 Summary: Add public documentation for SPARK_SUBMIT_OPTS
 Key: SPARK-30557
 URL: https://issues.apache.org/jira/browse/SPARK-30557
 Project: Spark
  Issue Type: Improvement
  Components: Deploy, Documentation
Affects Versions: 2.4.4
Reporter: Nicholas Chammas


Is `SPARK_SUBMIT_OPTS` part of Spark's public interface? If so, it needs some 
documentation. I cannot see it documented 
[anywhere|https://github.com/apache/spark/search?q=SPARK_SUBMIT_OPTS_q=SPARK_SUBMIT_OPTS]
 in the docs.

How do you use it? What is it useful for? What's an example usage? etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27868) Better document shuffle / RPC listen backlog

2020-01-17 Thread Marcelo Masiero Vanzin (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018219#comment-17018219
 ] 

Marcelo Masiero Vanzin commented on SPARK-27868:


It's ok for now, it's done. Hopefully 3.0 will come out soon and the "main" 
documentation on the site will have the info.

> Better document shuffle / RPC listen backlog
> 
>
> Key: SPARK-27868
> URL: https://issues.apache.org/jira/browse/SPARK-27868
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, Spark Core
>Affects Versions: 2.4.3
>Reporter: Marcelo Masiero Vanzin
>Assignee: Marcelo Masiero Vanzin
>Priority: Minor
>  Labels: release-notes
> Fix For: 3.0.0
>
>
> The option to control the listen socket backlog for RPC and shuffle servers 
> is not documented in our public docs.
> The only piece of documentation is in a Java class, and even that 
> documentation is incorrect:
> {code}
>   /** Requested maximum length of the queue of incoming connections. Default 
> -1 for no backlog. */
>   public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 
> -1); }
> {code}
> The default value actual causes the default value from the JRE to be used, 
> which is 50 according to the docs.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22590) Broadcast thread propagates the localProperties to task

2020-01-17 Thread Ajith S (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-22590:

Affects Version/s: 3.0.0
   2.4.4

> Broadcast thread propagates the localProperties to task
> ---
>
> Key: SPARK-22590
> URL: https://issues.apache.org/jira/browse/SPARK-22590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0, 2.4.4, 3.0.0
>Reporter: Ajith S
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestProps.scala
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing parallel jobs and threadpools have idle threads
> Explanation: 
>  When executing parallel jobs via {{BroadcastExchangeExec}}, the 
> {{relationFuture}} is evaluated via a seperate thread. The threads inherit 
> the {{localProperties}} from sparkContext as they are the child threads.
>  These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle 
> threads. 
>  Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}
> Attached is a test-case to simulate this behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30556) SubqueryExec passes local properties to SubqueryExec.executionContext

2020-01-17 Thread Ajith S (Jira)

Ajith S created SPARK-30556:
---

 Summary: SubqueryExec passes local properties to 
SubqueryExec.executionContext
 Key: SPARK-30556
 URL: https://issues.apache.org/jira/browse/SPARK-30556
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.4, 3.0.0
Reporter: Ajith S


Local properties set via sparkContext are not available as TaskContext 
properties when executing  jobs and threadpools have idle threads which are 
reused

Explanation:
When SubqueryExec, the {{relationFuture}} is evaluated via a separate thread. 
The threads inherit the {{localProperties}} from sparkContext as they are the 
child threads.
These threads are controlled via the executionContext (thread pools). Each 
Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a 
subsequent new query, the thread local properties will not be inherited from 
spark context (thread properties are inherited only on thread creation) hence 
end up having old or no properties set. This will cause taskset properties to 
be missing when properties are transferred by child thread via 
{{sparkContext.runJob/submitJob}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30041) Add Codegen Stage Id to Stage DAG visualization in Web UI

2020-01-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30041:
---

Assignee: Luca Canali

> Add Codegen Stage Id to Stage DAG visualization in Web UI
> -
>
> Key: SPARK-30041
> URL: https://issues.apache.org/jira/browse/SPARK-30041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Attachments: Snippet_StagesDags_with_CodegenId _annotated.png
>
>
> SPARK-29894 provides information on the Codegen Stage Id in WEBUI for SQL 
> Plan graphs. Similarly, this proposes to add Codegen Stage Id in the DAG 
> visualization for Stage execution. DAGs for Stage execution are available in 
> the WEBUI under the Jobs and Stages tabs.
>  This is proposed as an aid for drill-down analysis of complex SQL statement 
> execution, as it is not always easy to match parts of the SQL Plan graph with 
> the corresponding Stage DAG execution graph. Adding Codegen Stage Id for 
> WholeStageCodegen operations makes this task easier.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30041) Add Codegen Stage Id to Stage DAG visualization in Web UI

2020-01-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30041.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26675
[https://github.com/apache/spark/pull/26675]

> Add Codegen Stage Id to Stage DAG visualization in Web UI
> -
>
> Key: SPARK-30041
> URL: https://issues.apache.org/jira/browse/SPARK-30041
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Web UI
>Affects Versions: 3.0.0
>Reporter: Luca Canali
>Assignee: Luca Canali
>Priority: Minor
> Fix For: 3.0.0
>
> Attachments: Snippet_StagesDags_with_CodegenId _annotated.png
>
>
> SPARK-29894 provides information on the Codegen Stage Id in WEBUI for SQL 
> Plan graphs. Similarly, this proposes to add Codegen Stage Id in the DAG 
> visualization for Stage execution. DAGs for Stage execution are available in 
> the WEBUI under the Jobs and Stages tabs.
>  This is proposed as an aid for drill-down analysis of complex SQL statement 
> execution, as it is not always easy to match parts of the SQL Plan graph with 
> the corresponding Stage DAG execution graph. Adding Codegen Stage Id for 
> WholeStageCodegen operations makes this task easier.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22590) Broadcast thread propagates the localProperties to task

2020-01-17 Thread Ajith S (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-22590:

Summary: Broadcast thread propagates the localProperties to task  (was: 
SparkContext's local properties missing from TaskContext properties)

> Broadcast thread propagates the localProperties to task
> ---
>
> Key: SPARK-22590
> URL: https://issues.apache.org/jira/browse/SPARK-22590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ajith S
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestProps.scala
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing parallel jobs and threadpools have idle threads
> Explanation: 
>  When executing parallel jobs via {{BroadcastExchangeExec}}, the 
> {{relationFuture}} is evaluated via a seperate thread. The threads inherit 
> the {{localProperties}} from sparkContext as they are the child threads.
>  These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle 
> threads. 
>  Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}
> Attached is a test-case to simulate this behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-22590) SparkContext's local properties missing from TaskContext properties

2020-01-17 Thread Ajith S (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S updated SPARK-22590:

Description: 
Local properties set via sparkContext are not available as TaskContext 
properties when executing parallel jobs and threadpools have idle threads

Explanation: 
 When executing parallel jobs via {{BroadcastExchangeExec}}, the 
{{relationFuture}} is evaluated via a seperate thread. The threads inherit the 
{{localProperties}} from sparkContext as they are the child threads.
 These threads are controlled via the executionContext (thread pools). Each 
Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. 
 Scenarios where the thread pool has threads which are idle and reused for a 
subsequent new query, the thread local properties will not be inherited from 
spark context (thread properties are inherited only on thread creation) hence 
end up having old or no properties set. This will cause taskset properties to 
be missing when properties are transferred by child thread via 
{{sparkContext.runJob/submitJob}}

Attached is a test-case to simulate this behavior

  was:
Local properties set via sparkContext are not available as TaskContext 
properties when executing parallel jobs and threadpools have idle threads

Explanation:  
When executing parallel jobs via {{BroadcastExchangeExec}} or {{SubqueryExec}}, 
the {{relationFuture}} is evaluated via a seperate thread. The threads inherit 
the {{localProperties}} from sparkContext as they are the child threads.
These threads are controlled via the executionContext (thread pools). Each 
Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle threads. 
Scenarios where the thread pool has threads which are idle and reused for a 
subsequent new query, the thread local properties will not be inherited from 
spark context (thread properties are inherited only on thread creation) hence 
end up having old or no properties set. This will cause taskset properties to 
be missing when properties are transferred by child thread via 
{{sparkContext.runJob/submitJob}}

Attached is a test-case to simulate this behavior



> SparkContext's local properties missing from TaskContext properties
> ---
>
> Key: SPARK-22590
> URL: https://issues.apache.org/jira/browse/SPARK-22590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ajith S
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestProps.scala
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing parallel jobs and threadpools have idle threads
> Explanation: 
>  When executing parallel jobs via {{BroadcastExchangeExec}}, the 
> {{relationFuture}} is evaluated via a seperate thread. The threads inherit 
> the {{localProperties}} from sparkContext as they are the child threads.
>  These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle 
> threads. 
>  Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}
> Attached is a test-case to simulate this behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-22590) SparkContext's local properties missing from TaskContext properties

2020-01-17 Thread Ajith S (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-22590?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ajith S reopened SPARK-22590:
-

Adding Fix

> SparkContext's local properties missing from TaskContext properties
> ---
>
> Key: SPARK-22590
> URL: https://issues.apache.org/jira/browse/SPARK-22590
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Ajith S
>Priority: Major
>  Labels: bulk-closed
> Attachments: TestProps.scala
>
>
> Local properties set via sparkContext are not available as TaskContext 
> properties when executing parallel jobs and threadpools have idle threads
> Explanation:  
> When executing parallel jobs via {{BroadcastExchangeExec}} or 
> {{SubqueryExec}}, the {{relationFuture}} is evaluated via a seperate thread. 
> The threads inherit the {{localProperties}} from sparkContext as they are the 
> child threads.
> These threads are controlled via the executionContext (thread pools). Each 
> Thread pool has a default {{keepAliveSeconds}} of 60 seconds for idle 
> threads. 
> Scenarios where the thread pool has threads which are idle and reused for a 
> subsequent new query, the thread local properties will not be inherited from 
> spark context (thread properties are inherited only on thread creation) hence 
> end up having old or no properties set. This will cause taskset properties to 
> be missing when properties are transferred by child thread via 
> {{sparkContext.runJob/submitJob}}
> Attached is a test-case to simulate this behavior



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30555) MERGE INTO insert action should only access columns from source table

2020-01-17 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-30555:
---

 Summary: MERGE INTO insert action should only access columns from 
source table
 Key: SPARK-30555
 URL: https://issues.apache.org/jira/browse/SPARK-30555
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30554) Return Iterable from FailureSafeParser.rawParser

2020-01-17 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-30554:
--

 Summary: Return Iterable from FailureSafeParser.rawParser
 Key: SPARK-30554
 URL: https://issues.apache.org/jira/browse/SPARK-30554
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.4
Reporter: Maxim Gekk


Currently, rawParser of FailureSafeParser has the signature `IN => 
Seq[InternalRow]` which is strict requirement for underlying rawParser. It 
could return Option, for example like CSV datasource. The ticket aims to 
refactor FailureSafeParser, and change Seq to Iterable. Also. need to modify 
CSV and JSON datasources where FailureSafeParser is used, basically.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30553) structured-streaming documentation java watermark group by

2020-01-17 Thread bettermouse (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bettermouse updated SPARK-30553:

Description: 
[http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]

I write code according to this by java and scala.

java
{code:java}
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = 
SparkSession.builder().appName("test").master("local[*]")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate();Dataset lines = 
spark.readStream().format("socket")
.option("host", "skynet")
.option("includeTimestamp", true)
.option("port", ).load();
Dataset words = lines.select("timestamp", "value");
Dataset count = words.withWatermark("timestamp", "10 seconds")
.groupBy(functions.window(words.col("timestamp"), "10 seconds", 
"10 seconds")
, words.col("value")).count();
StreamingQuery start = count.writeStream()
.outputMode("update")
.format("console").start();
start.awaitTermination();}
{code}
scala

 
{code:java}
 def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("test").
  master("local[*]").
  config("spark.sql.shuffle.partitions", 1)
  .getOrCreate
import spark.implicits._
val lines = spark.readStream.format("socket").
  option("host", "skynet").option("includeTimestamp", true).
  option("port", ).load
val words = lines.select("timestamp", "value")
val count = words.withWatermark("timestamp", "10 seconds").
  groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
  .count()
val start = count.writeStream.outputMode("update").format("console").start
start.awaitTermination()
  }
{code}
This is according to official documents. written in Java I found metrics 
"stateOnCurrentVersionSizeBytes" always increase .but scala is ok.

 

java

 
{code:java}
== Physical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
+- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
output=[window#11, value#0, count#10L])
   +- StateStoreSave [window#11, value#0], state info [ checkpoint = 
file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
 runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions 
= 1], Update, 1579274372624, 2
  +- *(3) HashAggregate(keys=[window#11, value#0], 
functions=[merge_count(1)], output=[window#11, value#0, count#21L])
 +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
 runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions 
= 1], 2
+- *(2) HashAggregate(keys=[window#11, value#0], 
functions=[merge_count(1)], output=[window#11, value#0, count#21L])
   +- Exchange hashpartitioning(window#11, value#0, 1)
  +- *(1) HashAggregate(keys=[window#11, value#0], 
functions=[partial_count(1)], output=[window#11, value#0, count#21L])
 +- *(1) Project [named_struct(start, 
precisetimestampconversion(CASE WHEN 
(cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
LongType) - 0) as double) / 1.0E7)) as double) = 
(cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as 
double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, 
TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE 
CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 
0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
TimestampType), end, precisetimestampconversion(CASE WHEN 
(cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
LongType) - 0) as double) / 1.0E7)) as double) = 
(cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as 
double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, 
TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE 
CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 
0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 1000), LongType, 
TimestampType)) AS window#11, value#0]
+- *(1) Filter isnotnull(timestamp#1)
   +- EventTimeWatermark timestamp#1: timestamp, 
interval 10 seconds
  +- LocalTableScan , [timestamp#1, value#0]

{code}
 

 

scala 

 

 
{code:java}
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4149892c
+- *(4)

[jira] [Updated] (SPARK-30553) structured-streaming documentation java watermark group by

2020-01-17 Thread bettermouse (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

bettermouse updated SPARK-30553:

Priority: Trivial  (was: Major)

> structured-streaming documentation  java   watermark group by
> -
>
> Key: SPARK-30553
> URL: https://issues.apache.org/jira/browse/SPARK-30553
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation
>Affects Versions: 2.4.4
>Reporter: bettermouse
>Priority: Trivial
>
> [http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]
> I write code according to this by java and scala.
> java
> {code:java}
> public static void main(String[] args) throws StreamingQueryException {
> SparkSession spark = 
> SparkSession.builder().appName("test").master("local[*]")
> .config("spark.sql.shuffle.partitions", 1)
> .getOrCreate();Dataset lines = 
> spark.readStream().format("socket")
> .option("host", "skynet")
> .option("includeTimestamp",true)
> .option("port", ).load();
> Dataset words = lines.select("timestamp", "value");
> Dataset count = words.withWatermark("timestamp", "10 seconds")
> .groupBy(functions.window(words.col("timestamp"), "10 
> seconds", "10 seconds")
> , words.col("value")).count();
> StreamingQuery start = count.writeStream()
> .outputMode("update")
> .format("console").start();
> start.awaitTermination();}
> {code}
> scala
>  
> {code:java}
>  def main(args: Array[String]): Unit = {
> val spark = SparkSession.builder.appName("test").
>   master("local[*]").
>   config("spark.sql.shuffle.partitions", 1)
>   .getOrCreate
> import spark.implicits._
> val lines = spark.readStream.format("socket").
>   option("host", "skynet").option("includeTimestamp", true).
>   option("port", ).load
> val words = lines.select("timestamp", "value")
> val count = words.withWatermark("timestamp", "10 seconds").
>   groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
>   .count()
> val start = count.writeStream.outputMode("update").format("console").start
> start.awaitTermination()
>   }
> {code}
> This is according to official documents. written in Java I found metrics 
> "stateOnCurrentVersionSizeBytes" always increase .but scala is ok.
>  
> java
>  
> {code:java}
> == Physical Plan ==
> == Physical Plan ==
> WriteToDataSourceV2 
> org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
> +- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
> output=[window#11, value#0, count#10L])
>+- StateStoreSave [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], Update, 1579274372624, 2
>   +- *(3) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>  +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
> file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
>  runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, 
> numPartitions = 1], 2
> +- *(2) HashAggregate(keys=[window#11, value#0], 
> functions=[merge_count(1)], output=[window#11, value#0, count#21L])
>+- Exchange hashpartitioning(window#11, value#0, 1)
>   +- *(1) HashAggregate(keys=[window#11, value#0], 
> functions=[partial_count(1)], output=[window#11, value#0, count#21L])
>  +- *(1) Project [named_struct(start, 
> precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) + 1) ELSE 
> CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) 
> - 0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
> TimestampType), end, precisetimestampconversion(CASE WHEN 
> (cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
> LongType) - 0) as double) / 1.0E7)) as double) = 
> (cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) 
> as double) / 1.0E7)) THEN 
> (CEIL((cast((precisetimestampconversion(timestamp#1,

[jira] [Created] (SPARK-30553) structured-streaming documentation java watermark group by

2020-01-17 Thread bettermouse (Jira)

bettermouse created SPARK-30553:
---

 Summary: structured-streaming documentation  java   watermark 
group by
 Key: SPARK-30553
 URL: https://issues.apache.org/jira/browse/SPARK-30553
 Project: Spark
  Issue Type: Bug
  Components: Documentation
Affects Versions: 2.4.4
Reporter: bettermouse


[http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#handling-late-data-and-watermarking]

I write code according to this by java and scala.

java
{code:java}
public static void main(String[] args) throws StreamingQueryException {
SparkSession spark = 
SparkSession.builder().appName("test").master("local[*]")
.config("spark.sql.shuffle.partitions", 1)
.getOrCreate();Dataset lines = 
spark.readStream().format("socket")
.option("host", "skynet")
.option("includeTimestamp",true)
.option("port", ).load();
Dataset words = lines.select("timestamp", "value");
Dataset count = words.withWatermark("timestamp", "10 seconds")
.groupBy(functions.window(words.col("timestamp"), "10 seconds", 
"10 seconds")
, words.col("value")).count();
StreamingQuery start = count.writeStream()
.outputMode("update")
.format("console").start();
start.awaitTermination();}
{code}
scala

 
{code:java}
 def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.appName("test").
  master("local[*]").
  config("spark.sql.shuffle.partitions", 1)
  .getOrCreate
import spark.implicits._
val lines = spark.readStream.format("socket").
  option("host", "skynet").option("includeTimestamp", true).
  option("port", ).load
val words = lines.select("timestamp", "value")
val count = words.withWatermark("timestamp", "10 seconds").
  groupBy(window($"timestamp", "10 seconds", "10 seconds"), $"value")
  .count()
val start = count.writeStream.outputMode("update").format("console").start
start.awaitTermination()
  }
{code}
This is according to official documents. written in Java I found metrics 
"stateOnCurrentVersionSizeBytes" always increase .but scala is ok.

 

java

 
{code:java}
== Physical Plan ==
== Physical Plan ==
WriteToDataSourceV2 
org.apache.spark.sql.execution.streaming.sources.MicroBatchWriter@4176a001
+- *(4) HashAggregate(keys=[window#11, value#0], functions=[count(1)], 
output=[window#11, value#0, count#10L])
   +- StateStoreSave [window#11, value#0], state info [ checkpoint = 
file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
 runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions 
= 1], Update, 1579274372624, 2
  +- *(3) HashAggregate(keys=[window#11, value#0], 
functions=[merge_count(1)], output=[window#11, value#0, count#21L])
 +- StateStoreRestore [window#11, value#0], state info [ checkpoint = 
file:/C:/Users/chenhao/AppData/Local/Temp/temporary-63acf9b1-9249-40db-ab33-9dcadf5736aa/state,
 runId = d38b8fee-6cd0-441c-87da-a4e3660856a3, opId = 0, ver = 5, numPartitions 
= 1], 2
+- *(2) HashAggregate(keys=[window#11, value#0], 
functions=[merge_count(1)], output=[window#11, value#0, count#21L])
   +- Exchange hashpartitioning(window#11, value#0, 1)
  +- *(1) HashAggregate(keys=[window#11, value#0], 
functions=[partial_count(1)], output=[window#11, value#0, count#21L])
 +- *(1) Project [named_struct(start, 
precisetimestampconversion(CASE WHEN 
(cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
LongType) - 0) as double) / 1.0E7)) as double) = 
(cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as 
double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, 
TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE 
CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 
0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 0), LongType, 
TimestampType), end, precisetimestampconversion(CASE WHEN 
(cast(CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, 
LongType) - 0) as double) / 1.0E7)) as double) = 
(cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 0) as 
double) / 1.0E7)) THEN (CEIL((cast((precisetimestampconversion(timestamp#1, 
TimestampType, LongType) - 0) as double) / 1.0E7)) + 1) ELSE 
CEIL((cast((precisetimestampconversion(timestamp#1, TimestampType, LongType) - 
0) as double) / 1.0E7)) END + 0) - 1) * 1000) + 1000), LongType, 
TimestampType)) AS window#11, value#0]
+- *(1) Filter isnotnull(timestamp#1)
   +- EventTimeWatermark timestamp#1: timestamp, 
interval 10 seconds

[jira] [Assigned] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly

2020-01-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-30310:


Assignee: Tin Hang To

> SparkUncaughtExceptionHandler halts running process unexpectedly
> 
>
> Key: SPARK-30310
> URL: https://issues.apache.org/jira/browse/SPARK-30310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Tin Hang To
>Assignee: Tin Hang To
>Priority: Major
>
> During 2.4.x testing, we have many occasions where the Worker process would 
> just DEAD unexpectedly, with the Worker log ends with:
>  
> {{ERROR SparkUncaughtExceptionHandler: scala.MatchError:  <...callstack...>}}
>  
> We get the same callstack during our 2.3.x testing but the Worker process 
> stays up.
> Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the 
> 2.3.x version,  we found out SPARK-24294 introduced the following change:
> {{exception catch {}}
> {{  case _: OutOfMemoryError =>}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] 
> =>}}
> {{    // SPARK-24294: This is defensive code, in case that 
> SparkFatalException is}}
> {{    // misused and uncaught.}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case _ if exitOnUncaughtException =>}}
> {{    System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}}
> {{}}}
>  
> This code has the _ if exitOnUncaughtException case, but not the other _ 
> cases.  As a result, when exitOnUncaughtException is false (Master and 
> Worker) and exception doesn't match any of the match cases (e.g., 
> IllegalStateException), Scala throws MatchError(exception) ("MatchError" 
> wrapper of the original exception).  Then the other catch block down below 
> thinks we have another uncaught exception, and halts the entire process with 
> SparkExitCode.UNCAUGHT_EXCEPTION_TWICE.
>  
> {{catch {}}
> {{  case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}}
> {{  case t: Throwable => 
> Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}}
> {{}}}
>  
> Therefore, even when exitOnUncaughtException is false, the process will halt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30310) SparkUncaughtExceptionHandler halts running process unexpectedly

2020-01-17 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-30310.
--
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26955
[https://github.com/apache/spark/pull/26955]

> SparkUncaughtExceptionHandler halts running process unexpectedly
> 
>
> Key: SPARK-30310
> URL: https://issues.apache.org/jira/browse/SPARK-30310
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Tin Hang To
>Assignee: Tin Hang To
>Priority: Major
> Fix For: 3.0.0
>
>
> During 2.4.x testing, we have many occasions where the Worker process would 
> just DEAD unexpectedly, with the Worker log ends with:
>  
> {{ERROR SparkUncaughtExceptionHandler: scala.MatchError:  <...callstack...>}}
>  
> We get the same callstack during our 2.3.x testing but the Worker process 
> stays up.
> Upon looking at the 2.4.x SparkUncaughtExceptionHandler.scala compared to the 
> 2.3.x version,  we found out SPARK-24294 introduced the following change:
> {{exception catch {}}
> {{  case _: OutOfMemoryError =>}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case e: SparkFatalException if e.throwable.isInstanceOf[OutOfMemoryError] 
> =>}}
> {{    // SPARK-24294: This is defensive code, in case that 
> SparkFatalException is}}
> {{    // misused and uncaught.}}
> {{    System.exit(SparkExitCode.OOM)}}
> {{  case _ if exitOnUncaughtException =>}}
> {{    System.exit(SparkExitCode.UNCAUGHT_EXCEPTION)}}
> {{}}}
>  
> This code has the _ if exitOnUncaughtException case, but not the other _ 
> cases.  As a result, when exitOnUncaughtException is false (Master and 
> Worker) and exception doesn't match any of the match cases (e.g., 
> IllegalStateException), Scala throws MatchError(exception) ("MatchError" 
> wrapper of the original exception).  Then the other catch block down below 
> thinks we have another uncaught exception, and halts the entire process with 
> SparkExitCode.UNCAUGHT_EXCEPTION_TWICE.
>  
> {{catch {}}
> {{  case oom: OutOfMemoryError => Runtime.getRuntime.halt(SparkExitCode.OOM)}}
> {{  case t: Throwable => 
> Runtime.getRuntime.halt(SparkExitCode.UNCAUGHT_EXCEPTION_TWICE)}}
> {{}}}
>  
> Therefore, even when exitOnUncaughtException is false, the process will halt.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-27249) Developers API for Transformers beyond UnaryTransformer

2020-01-17 Thread Nick Afshartous (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-27249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17018105#comment-17018105
 ] 

Nick Afshartous commented on SPARK-27249:
-

Thanks Everett, and can someone with permission assign this ticket to me.

> Developers API for Transformers beyond UnaryTransformer
> ---
>
> Key: SPARK-27249
> URL: https://issues.apache.org/jira/browse/SPARK-27249
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Everett Rush
>Priority: Minor
>  Labels: starter
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> It would be nice to have a developers' API for dataset transformations that 
> need more than one column from a row (ie UnaryTransformer inputs one column 
> and outputs one column) or that contain objects too expensive to initialize 
> repeatedly in a UDF such as a database connection. 
>  
> Design:
> Abstract class PartitionTransformer extends Transformer and defines the 
> partition transformation function as Iterator[Row] => Iterator[Row]
> NB: This parallels the UnaryTransformer createTransformFunc method
>  
> When developers subclass this transformer, they can provide their own schema 
> for the output Row in which case the PartitionTransformer creates a row 
> encoder and executes the transformation. Alternatively the developer can set 
> output Datatype and output col name. Then the PartitionTransformer class will 
> create a new schema, a row encoder, and execute the transformation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30552) Chained spark column expressions with distinct windows specs produce inefficient DAG

2020-01-17 Thread Franz (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Franz updated SPARK-30552:
--
Environment: 
python : 3.6.9.final.0
 python-bits : 64
 OS : Windows
 OS-release : 10
 machine : AMD64
 processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel



pyspark: 2.4.4

pandas : 0.25.3
 numpy : 1.17.4

pyarrow : 0.15.1

  was:
INSTALLED VERSIONS
--
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None


> Chained spark column expressions with distinct windows specs produce 
> inefficient DAG
> 
>
> Key: SPARK-30552
> URL: https://issues.apache.org/jira/browse/SPARK-30552
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.4
> Environment: python : 3.6.9.final.0
>  python-bits : 64
>  OS : Windows
>  OS-release : 10
>  machine : AMD64
>  processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
> pyspark: 2.4.4
> pandas : 0.25.3
>  numpy : 1.17.4
> pyarrow : 0.15.1
>Reporter: Franz
>Priority: Major
>
> h2.  Context
> Let's say you deal with time series data. Your desired outcome relies on 
> multiple window functions with distinct window specifications. The result may 
> resemble a single spark column expression, like an identifier for intervals.
> h2. Status Quo
> Usually, I don't store intermediate results with `df.withColumn` but rather 
> chain/stack column expressions and trust Spark to find the most effective DAG 
> (when dealing with DataFrame).
> h2. Reproducible example
> However, in the following example (PySpark 2.4.4 standalone), storing an 
> intermediate result with `df.withColumn` reduces the DAG complexity. Let's 
> consider following test setup:
> {code:python}
> import pandas as pd
> import numpy as np
> from pyspark.sql import SparkSession, Window
> from pyspark.sql import functions as F
> spark = SparkSession.builder.getOrCreate()
> dfp = pd.DataFrame(
> {
> "col1": np.random.randint(0, 5, size=100),
> "col2": np.random.randint(0, 5, size=100),
> "col3": np.random.randint(0, 5, size=100),
> "col4": np.random.randint(0, 5, size=100),
> }
> )
> df = spark.createDataFrame(dfp)
> df.show(5)
> +++++
> |col1|col2|col3|col4|
> +++++
> |   1|   2|   4|   1|
> |   0|   2|   3|   0|
> |   2|   0|   1|   0|
> |   4|   1|   1|   2|
> |   1|   3|   0|   4|
> +++++
> only showing top 5 rows
> {code}
> The computation is arbitrary. Basically we have 2 window specs and 3 
> computational steps. The 3 computational steps are dependend on each other 
> and use alternating window specs:
> {code:python}
> w1 = Window.partitionBy("col1").orderBy("col2")
> w2 = Window.partitionBy("col3").orderBy("col4")
> # first step, arbitrary window func over 1st window
> step1 = F.lag("col3").over(w1)
> # second step, arbitrary window func over 2nd window with step 1
> step2 = F.lag(step1).over(w2)
> # third step, arbitrary window func over 1st window with step 2
> step3 = F.when(step2 > 1, F.max(step2).over(w1))
> df_result = df.withColumn("result", step3)
> {code}
> Inspecting the phyiscal plan via `df_result.explain()` reveals 4 exchanges 
> and sorts! However, only 3 should be necessary here because we change the 
> window spec only twice. 
> {code:python}
> df_result.explain()
> == Physical Plan ==
> *(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) 
> THEN _we1#26L END AS result#22L]
> +- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC 
> NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], 
> [col4#3L ASC NULLS FIRST]
>+- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
>   +- Exchange hashpartitioning(col3#2L, 200)
>  +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, 
> _we1#26L]
> +- Window [max(_w1#24L)

[jira] [Created] (SPARK-30552) Chained spark column expressions with distinct windows specs produce inefficient DAG

2020-01-17 Thread Franz (Jira)

Franz created SPARK-30552:
-

 Summary: Chained spark column expressions with distinct windows 
specs produce inefficient DAG
 Key: SPARK-30552
 URL: https://issues.apache.org/jira/browse/SPARK-30552
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 2.4.4
 Environment: INSTALLED VERSIONS
--
commit : None
python : 3.6.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : de_DE.UTF-8
LOCALE : None.None

pandas : 0.25.3
numpy : 1.17.4
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 41.6.0.post20191030
Cython : None
pytest : 5.3.0
hypothesis : None
sphinx : 2.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.11.1
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
Reporter: Franz


h2.  Context
Let's say you deal with time series data. Your desired outcome relies on 
multiple window functions with distinct window specifications. The result may 
resemble a single spark column expression, like an identifier for intervals.

h2. Status Quo
Usually, I don't store intermediate results with `df.withColumn` but rather 
chain/stack column expressions and trust Spark to find the most effective DAG 
(when dealing with DataFrame).

h2. Reproducible example
However, in the following example (PySpark 2.4.4 standalone), storing an 
intermediate result with `df.withColumn` reduces the DAG complexity. Let's 
consider following test setup:

{code:python}
import pandas as pd
import numpy as np

from pyspark.sql import SparkSession, Window
from pyspark.sql import functions as F

spark = SparkSession.builder.getOrCreate()

dfp = pd.DataFrame(
{
"col1": np.random.randint(0, 5, size=100),
"col2": np.random.randint(0, 5, size=100),
"col3": np.random.randint(0, 5, size=100),
"col4": np.random.randint(0, 5, size=100),
}
)

df = spark.createDataFrame(dfp)
df.show(5)


+++++
|col1|col2|col3|col4|
+++++
|   1|   2|   4|   1|
|   0|   2|   3|   0|
|   2|   0|   1|   0|
|   4|   1|   1|   2|
|   1|   3|   0|   4|
+++++
only showing top 5 rows
{code}



The computation is arbitrary. Basically we have 2 window specs and 3 
computational steps. The 3 computational steps are dependend on each other and 
use alternating window specs:


{code:python}
w1 = Window.partitionBy("col1").orderBy("col2")
w2 = Window.partitionBy("col3").orderBy("col4")

# first step, arbitrary window func over 1st window
step1 = F.lag("col3").over(w1)

# second step, arbitrary window func over 2nd window with step 1
step2 = F.lag(step1).over(w2)

# third step, arbitrary window func over 1st window with step 2
step3 = F.when(step2 > 1, F.max(step2).over(w1))

df_result = df.withColumn("result", step3)
{code}


Inspecting the phyiscal plan via `df_result.explain()` reveals 4 exchanges and 
sorts! However, only 3 should be necessary here because we change the window 
spec only twice. 

{code:python}
df_result.explain()

== Physical Plan ==
*(7) Project [col1#0L, col2#1L, col3#2L, col4#3L, CASE WHEN (_we0#25L > 1) THEN 
_we1#26L END AS result#22L]
+- Window [lag(_w0#23L, 1, null) windowspecdefinition(col3#2L, col4#3L ASC 
NULLS FIRST, specifiedwindowframe(RowFrame, -1, -1)) AS _we0#25L], [col3#2L], 
[col4#3L ASC NULLS FIRST]
   +- *(6) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col3#2L, 200)
 +- *(5) Project [col1#0L, col2#1L, col3#2L, col4#3L, _w0#23L, _we1#26L]
+- Window [max(_w1#24L) windowspecdefinition(col1#0L, col2#1L ASC 
NULLS FIRST, specifiedwindowframe(RangeFrame, unboundedpreceding$(), 
currentrow$())) AS _we1#26L], [col1#0L], [col2#1L ASC NULLS FIRST]
   +- *(4) Sort [col1#0L ASC NULLS FIRST, col2#1L ASC NULLS FIRST], 
false, 0
  +- Exchange hashpartitioning(col1#0L, 200)
 +- *(3) Project [col1#0L, col2#1L, col3#2L, col4#3L, 
_w0#23L, _w1#24L]
+- Window [lag(_w0#27L, 1, null) 
windowspecdefinition(col3#2L, col4#3L ASC NULLS FIRST, 
specifiedwindowframe(RowFrame, -1, -1)) AS _w1#24L], [col3#2L], [col4#3L ASC 
NULLS FIRST]
   +- *(2) Sort [col3#2L ASC NULLS FIRST, col4#3L ASC 
NULLS FIRST], false, 0
  +- Exchange hashpartitioning(col3#2L, 200)
 +- Window [lag(col3#2L, 1,

[jira] [Resolved] (SPARK-29306) Executors need to track what ResourceProfile they are created with

2020-01-17 Thread Thomas Graves (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-29306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-29306.
---
Fix Version/s: 3.0.0
   Resolution: Fixed

> Executors need to track what ResourceProfile they are created with 
> ---
>
> Key: SPARK-29306
> URL: https://issues.apache.org/jira/browse/SPARK-29306
> Project: Spark
>  Issue Type: Story
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Major
> Fix For: 3.0.0
>
>
> For stage level scheduling, the Executors need to report what ResourceProfile 
> they are created with so that the ExecutorMonitor can track them and the 
> ExecutorAllocationManager can use that information to know how many to 
> request, etc.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017820#comment-17017820
 ] 

Sivakumar edited comment on SPARK-30542 at 1/17/20 12:40 PM:
-

Hi Jungtaek,

I thought this might be a feature that should be added to structured streaming. 

Earlier with Spark Dstreams two jobs can have a same base path.

But with Spark structured streaming I don't have that flexibility. I guess this 
should be a feature that structured streaming should support.

Also Please lemme know If you have any work around for this.


was (Author: sparksiva):
Earlier with Spark Dstreams two jobs can have a same base path.

But with Spark structured streaming I don't have that flexibility. I guess this 
should be a feature that structured streaming should support.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sivakumar updated SPARK-30542:
--
Comment: was deleted

(was: Hi Jungtaek,

I thought this might be a feature that should be added to structured streaming. 
Also Please lemme know If you have any work around for this.)

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017967#comment-17017967
 ] 

Sivakumar edited comment on SPARK-30542 at 1/17/20 12:40 PM:
-

Hi Jungtaek,

I thought this might be a feature that should be added to structured streaming. 
Also Please lemme know If you have any work around for this.


was (Author: sparksiva):
Hi Jungtaek,

I thought this might be a feature that should be added to structured streaming. 
Also Please lemme know If you have any work around for this.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017967#comment-17017967
 ] 

Sivakumar commented on SPARK-30542:
---

Hi Jungtaek,

I thought this might be a feature that should be added to structured streaming. 
Also Please lemme know If you have any work around for this.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30282) Migrate SHOW TBLPROPERTIES to new framework

2020-01-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-30282:

Summary: Migrate SHOW TBLPROPERTIES to new framework  (was: 
UnresolvedV2Relation should be resolved to temp view first)

> Migrate SHOW TBLPROPERTIES to new framework
> ---
>
> Key: SPARK-30282
> URL: https://issues.apache.org/jira/browse/SPARK-30282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> For the following v2 commands, _Analyzer.ResolveTables_ does not check 
> against the temp views before resolving _UnresolvedV2Relation_, thus it 
> always resolves _UnresolvedV2Relation_ to a table:
>  * ALTER TABLE
>  * DESCRIBE TABLE
>  * SHOW TBLPROPERTIES
> Thus, in the following example, 't' will be resolved to a table, not a temp 
> view:
> {code:java}
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i")
> sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i")
> sql("USE testcat.ns")
> sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table
> {code}
> For V2 commands, if a table is resolved to a temp view, it should error out 
> with a message that v2 command cannot handle temp views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29299) Intermittently getting "Cannot create the managed table error" while creating table from spark 2.4

2020-01-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-29299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017932#comment-17017932
 ] 

Steve Loughran commented on SPARK-29299:


Have you tried using the s3 optimised committer in EMR?

It still materializes files in the destination on task commit, so potentially 
retains the issue -I'd like to know if it does.
thanks

> Intermittently getting "Cannot create the managed table error" while creating 
> table from spark 2.4
> --
>
> Key: SPARK-29299
> URL: https://issues.apache.org/jira/browse/SPARK-29299
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Abhijeet
>Priority: Major
>
> We are facing below error in spark 2.4 intermittently when saving the managed 
> table from spark.
> Error -
>  pyspark.sql.utils.AnalysisException: u"Can not create the managed 
> table('`hive_issue`.`table`'). The associated 
> location('s3://\{bucket_name}/EMRFS_WARE_TEST167_new/warehouse/hive_issue.db/table')
>  already exists.;"
> Steps to reproduce--
>  1. Create dataframe from spark mid size data (30MB CSV file)
>  2. Save dataframe as a table
>  3. Terminate the session when above mentioned operation is in progress
> Note--
>  Session termination is just a way to reproduce this issue. In real time we 
> are facing this issue intermittently when we are running same spark jobs 
> multiple times. We use EMRFS and HDFS from EMR cluster and we face the same 
> issue on both of the systems.
>  The only ways we can fix this is by deleting the target folder where table 
> will keep its files which is not option for us and we need to keep historical 
> information in the table hence we use APPEND mode while writing to table.
> Sample code--
>  from pyspark.sql import SparkSession
>  sc = SparkSession.builder.enableHiveSupport().getOrCreate()
>  df = sc.read.csv("s3://\{sample-bucket}1/DATA/consumecomplians.csv")
>  print "STARTED WRITING TO TABLE"
>  # Terminate session using ctrl + c after this statement post df.write action 
> started
>  df.write.mode("append").saveAsTable("hive_issue.table")
>  print "COMPLETED WRITING TO TABLE"
> We went through the documentation of spark 2.4 [1] and found that spark is no 
> longer allowing to create manage tables on non empty folders.
> 1. Any reason behind change in the spark behavior
>  2. To us it looks like a breaking change as despite specifying "overwrite" 
> option spark in unable to wipe out existing data and create tables
>  3. Do we have any solution for this issue other that setting 
> "spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation" flag
> [1]
>  [https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30393) Too much ProvisionedThroughputExceededException while recover from checkpoint

2020-01-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017930#comment-17017930
 ] 

Steve Loughran commented on SPARK-30393:


{{ProvisionedThroughputExceededException}} means your client(s) are making more 
requests per second than that AWS Endpoint permits.

Applications are expected to recognise this and perform some kind of 
exponential backoff. It looks suspiciously like the spark-kinetic module is not 
doing this.

If this is the case, I recommend doing so (it's what s3A does for S3 503's and 
DDB ProvisionedThroughput). See 
https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AUtils.java#L390
 for the code to identify the problem; retries are straightforward as you can 
be confident the request was not processed.

In the meantime -reduce the number of workers trying to talk to that particular 
stream. AWS endpoint throttling means that their scalability can be sub-linear.

side issue: EMR's spark is a fork of AWS spark. You probably need to talk to 
them

> Too much ProvisionedThroughputExceededException while recover from checkpoint
> -
>
> Key: SPARK-30393
> URL: https://issues.apache.org/jira/browse/SPARK-30393
> Project: Spark
>  Issue Type: Question
>  Components: DStreams
>Affects Versions: 2.4.3
> Environment: I am using EMR 5.25.0, Spark 2.4.3, 
> spark-streaming-kinesis-asl 2.4.3 I have 6 r5.4xLarge in my cluster, plenty 
> of memory. 6 kinesis shards, I even increased to 12 shards but still see the 
> kinesis error
>Reporter: Stephen
>Priority: Major
> Attachments: kinesisexceedreadlimit.png, 
> kinesisusagewhilecheckpointrecoveryerror.png, 
> sparkuiwhilecheckpointrecoveryerror.png
>
>
> I have a spark application which consume from Kinesis with 6 shards. Data was 
> produced to Kinesis at at most 2000 records/second. At non peak time data 
> only comes in at 200 records/second. Each record is 0.5K Bytes. So 6 shards 
> is enough to handle that.
> I use reduceByKeyAndWindow and mapWithState in the program and the sliding 
> window is one hour long.
> Recently I am trying to checkpoint the application to S3. I am testing this 
> at nonpeak time so the data incoming rate is very low like 200 records/sec. I 
> run the Spark application by creating new context, checkpoint is created at 
> s3, but when I kill the app and restarts, it failed to recover from 
> checkpoint, and the error message is the following and my SparkUI shows all 
> the batches are stucked, and it takes a long time for the checkpoint recovery 
> to complete, 15 minutes to over an hour.
> I found lots of error message in the log related to Kinesis exceeding read 
> limit:
> {quote}19/12/24 00:15:21 WARN TaskSetManager: Lost task 571.0 in stage 33.0 
> (TID 4452, ip-172-17-32-11.ec2.internal, executor 9): 
> org.apache.spark.SparkException: Gave up after 3 retries while getting shard 
> iterator from sequence number 
> 49601654074184110438492229476281538439036626028298502210, last exception:
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator$$anonfun$retryOrTimeout$2.apply(KinesisBackedBlockRDD.scala:288)
> bq. at scala.Option.getOrElse(Option.scala:121)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.retryOrTimeout(KinesisBackedBlockRDD.scala:282)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getKinesisIterator(KinesisBackedBlockRDD.scala:246)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getRecords(KinesisBackedBlockRDD.scala:206)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:162)
> bq. at 
> org.apache.spark.streaming.kinesis.KinesisSequenceRangeIterator.getNext(KinesisBackedBlockRDD.scala:133)
> bq. at 
> org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:462)
> bq. at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
> bq. at 
> org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:187)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
> bq. at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
> bq. at org.apache.spark.scheduler.Task.run(Task.scala:121)
> bq.

[jira] [Commented] (SPARK-30460) Spark checkpoint failing after some run with S3 path

2020-01-17 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017928#comment-17017928
 ] 

Steve Loughran commented on SPARK-30460:


bq. I understand spark 3.0 has new committer but as you said it is not deeply 
tested. 

well, been tested and shipping with Hortonworks and Cloudera releases of Spark 
for a while

but: 
# doesn't do stream checkpoints. Nobody has looked at that, though it's 
something I'd like to see.
# you are seeing a stack trace on EMR; they have their own reimplementation of 
the committer; it may be different.


bq. I had a long discussion with AWS folks but they are asking to report this 
to open source to verify it

that stack trace shows it's their S3 connector which is rejecting the request 
-but the S3A one is going to to reject it in exactly the same way. 

You going to need a way to checkpoint that does not use append. Sorry


> Spark checkpoint failing after some run with S3 path 
> -
>
> Key: SPARK-30460
> URL: https://issues.apache.org/jira/browse/SPARK-30460
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.4.4
>Reporter: Sachin Pasalkar
>Priority: Major
>
> We are using EMR with the SQS as source of stream. However it is failing, 
> after 4-6 hours of run, with below exception. Application shows its running 
> but stops the processing the messages
> {code:java}
> 2020-01-06 13:04:10,548 WARN [BatchedWriteAheadLog Writer] 
> org.apache.spark.streaming.util.BatchedWriteAheadLog:BatchedWriteAheadLog 
> Writer failed to write ArrayBuffer(Record(java.nio.HeapByteBuffer[pos=0 
> lim=1226 cap=1226],1578315850302,Future()))
> java.lang.UnsupportedOperationException
>   at 
> com.amazon.ws.emr.hadoop.fs.s3n2.S3NativeFileSystem2.append(S3NativeFileSystem2.java:150)
>   at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1181)
>   at 
> com.amazon.ws.emr.hadoop.fs.EmrFileSystem.append(EmrFileSystem.java:295)
>   at 
> org.apache.spark.streaming.util.HdfsUtils$.getOutputStream(HdfsUtils.scala:35)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream$lzycompute(FileBasedWriteAheadLogWriter.scala:32)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.stream(FileBasedWriteAheadLogWriter.scala:32)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter.(FileBasedWriteAheadLogWriter.scala:35)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.getLogWriter(FileBasedWriteAheadLog.scala:229)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:94)
>   at 
> org.apache.spark.streaming.util.FileBasedWriteAheadLog.write(FileBasedWriteAheadLog.scala:50)
>   at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.org$apache$spark$streaming$util$BatchedWriteAheadLog$$flushRecords(BatchedWriteAheadLog.scala:175)
>   at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog$$anon$1.run(BatchedWriteAheadLog.scala:142)
>   at java.lang.Thread.run(Thread.java:748)
> 2020-01-06 13:04:10,554 WARN [wal-batching-thread-pool-0] 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker:Exception thrown 
> while writing record: 
> BlockAdditionEvent(ReceivedBlockInfo(0,Some(3),None,WriteAheadLogBasedStoreResult(input-0-1578315849800,Some(3),FileBasedWriteAheadLogSegment(s3://mss-prod-us-east-1-ueba-bucket/streaming/checkpoint/receivedData/0/log-1578315850001-1578315910001,0,5175
>  to the WriteAheadLog.
> org.apache.spark.SparkException: Exception thrown in awaitResult: 
>   at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
>   at 
> org.apache.spark.streaming.util.BatchedWriteAheadLog.write(BatchedWriteAheadLog.scala:84)
>   at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.writeToLog(ReceivedBlockTracker.scala:242)
>   at 
> org.apache.spark.streaming.scheduler.ReceivedBlockTracker.addBlock(ReceivedBlockTracker.scala:89)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker.org$apache$spark$streaming$scheduler$ReceiverTracker$$addBlock(ReceiverTracker.scala:347)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1$$anonfun$run$1.apply$mcV$sp(ReceiverTracker.scala:522)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
>   at 
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverTrackerEndpoint$$anonfun$receiveAndReply$1$$anon$1.run(ReceiverTracker.scala:520)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>   at

[jira] [Created] (SPARK-30551) Disable comparison for interval type

2020-01-17 Thread Kent Yao (Jira)

Kent Yao created SPARK-30551:


 Summary: Disable comparison for interval type
 Key: SPARK-30551
 URL: https://issues.apache.org/jira/browse/SPARK-30551
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


As we are not going to follow ANSI, it is weird to compare the year-month part 
to the day-time part for our current implementation of interval. Additionally, 
the current ordering logic comes from PostgreSQL where the implementation of 
the interval is messy. And we are not aiming PostgreSQL compliance at all.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30550) Random pyspark-shell applications being generated

2020-01-17 Thread Ram (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram updated SPARK-30550:

Attachment: Screenshot from 2020-01-17 15-43-33.png

> Random pyspark-shell applications being generated
> -
>
> Key: SPARK-30550
> URL: https://issues.apache.org/jira/browse/SPARK-30550
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit, Web UI
>Affects Versions: 2.3.2, 2.4.4
>Reporter: Ram
>Priority: Major
> Attachments: Screenshot from 2020-01-17 15-43-33.png
>
>
>  
> When we submit a particular spark job, this happens. Not sure from where 
> these pyspark-shell applications get generated, but they persist for like 5s 
> and gets killed. We're not able to figure put why this happens



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30550) Random pyspark-shell applications being generated

2020-01-17 Thread Ram (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ram updated SPARK-30550:

Description: 
 

When we submit a particular spark job, this happens. Not sure from where these 
pyspark-shell applications get generated, but they persist for like 5s and gets 
killed. We're not able to figure put why this happens

  was:
!image-2020-01-17-15-40-12-899.png!

When we submit a particular spark job, this happens. Not sure from where these 
pyspark-shell applications get generated, but they persist for like 5s and gets 
killed. We're not able to figure put why this happens


> Random pyspark-shell applications being generated
> -
>
> Key: SPARK-30550
> URL: https://issues.apache.org/jira/browse/SPARK-30550
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Shell, Spark Submit, Web UI
>Affects Versions: 2.3.2, 2.4.4
>Reporter: Ram
>Priority: Major
>
>  
> When we submit a particular spark job, this happens. Not sure from where 
> these pyspark-shell applications get generated, but they persist for like 5s 
> and gets killed. We're not able to figure put why this happens



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30550) Random pyspark-shell applications being generated

2020-01-17 Thread Ram (Jira)

Ram created SPARK-30550:
---

 Summary: Random pyspark-shell applications being generated
 Key: SPARK-30550
 URL: https://issues.apache.org/jira/browse/SPARK-30550
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Shell, Spark Submit, Web UI
Affects Versions: 2.4.4, 2.3.2
Reporter: Ram


!image-2020-01-17-15-40-12-899.png!

When we submit a particular spark job, this happens. Not sure from where these 
pyspark-shell applications get generated, but they persist for like 5s and gets 
killed. We're not able to figure put why this happens



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE

2020-01-17 Thread Ke Jia (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30549:
---
Summary: Fix the subquery metrics showing issue in UI When enable AQE  
(was: Fix the subquery metrics showing issue in UI)

> Fix the subquery metrics showing issue in UI When enable AQE
> 
>
> Key: SPARK-30549
> URL: https://issues.apache.org/jira/browse/SPARK-30549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the 
> subquery metrics can not be shown in UI. This PR will fix the subquery shown 
> issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30549) Fix the subquery metrics showing issue in UI When enable AQE

2020-01-17 Thread Ke Jia (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ke Jia updated SPARK-30549:
---
Description: After merged [https://github.com/apache/spark/pull/25316], the 
subquery metrics can not be shown in UI when enable AQE. This PR will fix the 
subquery shown issue.  (was: After merged 
[PR#25316|[https://github.com/apache/spark/pull/25316]], the subquery metrics 
can not be shown in UI. This PR will fix the subquery shown issue.)

> Fix the subquery metrics showing issue in UI When enable AQE
> 
>
> Key: SPARK-30549
> URL: https://issues.apache.org/jira/browse/SPARK-30549
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Ke Jia
>Priority: Major
>
> After merged [https://github.com/apache/spark/pull/25316], the subquery 
> metrics can not be shown in UI when enable AQE. This PR will fix the subquery 
> shown issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30549) Fix the subquery metrics showing issue in UI

2020-01-17 Thread Ke Jia (Jira)

Ke Jia created SPARK-30549:
--

 Summary: Fix the subquery metrics showing issue in UI
 Key: SPARK-30549
 URL: https://issues.apache.org/jira/browse/SPARK-30549
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Ke Jia


After merged [PR#25316|[https://github.com/apache/spark/pull/25316]], the 
subquery metrics can not be shown in UI. This PR will fix the subquery shown 
issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]

2020-01-17 Thread Enrico Minack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30319:
--
Affects Version/s: (was: 2.4.4)
   3.0.0

> Adds a stricter version of as[T]
> 
>
> Key: SPARK-30319
> URL: https://issues.apache.org/jira/browse/SPARK-30319
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Enrico Minack
>Priority: Major
>
> The behaviour of as[T] is not intuitive when you read code like 
> df.as[T].write.csv("data.csv"). The result depends on the actual schema of 
> df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The 
> expected behaviour is not provided elsewhere:
>  * Extra columns that are not part of the type {{T}} are not dropped.
>  * Order of columns is not aligned with schema of {{T}}.
>  * Columns are not cast to the types of {{T}}'s fields. They have to be cast 
> explicitly.
> A method that enforces schema of T on a given Dataset would be very 
> convenient and allows to articulate and guarantee above assumptions about 
> your data with the native Spark Dataset API. This method plays a more 
> explicit and enforcing role than as[T] with respect to columns, column order 
> and column type.
> Possible naming of a stricter version of {{as[T]}}:
>  * {{as[T](strict = true)}}
>  * {{toDS[T]}} (as in {{toDF}})
>  * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
> The naming {{toDS[T]}} is chosen here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30319) Adds a stricter version of as[T]

2020-01-17 Thread Enrico Minack (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Enrico Minack updated SPARK-30319:
--
Fix Version/s: (was: 3.0.0)

> Adds a stricter version of as[T]
> 
>
> Key: SPARK-30319
> URL: https://issues.apache.org/jira/browse/SPARK-30319
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.4.4
>Reporter: Enrico Minack
>Priority: Major
>
> The behaviour of as[T] is not intuitive when you read code like 
> df.as[T].write.csv("data.csv"). The result depends on the actual schema of 
> df, where def as[T](): Dataset[T] should be agnostic to the schema of df. The 
> expected behaviour is not provided elsewhere:
>  * Extra columns that are not part of the type {{T}} are not dropped.
>  * Order of columns is not aligned with schema of {{T}}.
>  * Columns are not cast to the types of {{T}}'s fields. They have to be cast 
> explicitly.
> A method that enforces schema of T on a given Dataset would be very 
> convenient and allows to articulate and guarantee above assumptions about 
> your data with the native Spark Dataset API. This method plays a more 
> explicit and enforcing role than as[T] with respect to columns, column order 
> and column type.
> Possible naming of a stricter version of {{as[T]}}:
>  * {{as[T](strict = true)}}
>  * {{toDS[T]}} (as in {{toDF}})
>  * {{selectAs[T]}} (as this is merely selecting the columns of schema {{T}})
> The naming {{toDS[T]}} is chosen here.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017820#comment-17017820
 ] 

Sivakumar commented on SPARK-30542:
---

Earlier with Spark Dstreams two jobs can have a same base path.

But with Spark structured streaming I don't have that flexibility. I guess this 
should be a feature that structured streaming should support.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Sivakumar (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sivakumar updated SPARK-30542:
--
Description: 
Hi All,

Spark Structured Streaming doesn't allow two structured streaming jobs to write 
data to the same base directory which is possible with using dstreams.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

Is it possible to create the __spark__metadata directory else where or disable 
without any data loss.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 

  was:
Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

Is it possible to create the __spark__metadata directory else where or disable 
without any data loss.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

 


> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> Spark Structured Streaming doesn't allow two structured streaming jobs to 
> write data to the same base directory which is possible with using dstreams.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-30282) UnresolvedV2Relation should be resolved to temp view first

2020-01-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-30282.
-
Fix Version/s: 3.0.0
   Resolution: Fixed

Issue resolved by pull request 26921
[https://github.com/apache/spark/pull/26921]

> UnresolvedV2Relation should be resolved to temp view first
> --
>
> Key: SPARK-30282
> URL: https://issues.apache.org/jira/browse/SPARK-30282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.0.0
>
>
> For the following v2 commands, _Analyzer.ResolveTables_ does not check 
> against the temp views before resolving _UnresolvedV2Relation_, thus it 
> always resolves _UnresolvedV2Relation_ to a table:
>  * ALTER TABLE
>  * DESCRIBE TABLE
>  * SHOW TBLPROPERTIES
> Thus, in the following example, 't' will be resolved to a table, not a temp 
> view:
> {code:java}
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i")
> sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i")
> sql("USE testcat.ns")
> sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table
> {code}
> For V2 commands, if a table is resolved to a temp view, it should error out 
> with a message that v2 command cannot handle temp views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30282) UnresolvedV2Relation should be resolved to temp view first

2020-01-17 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-30282:
---

Assignee: Terry Kim

> UnresolvedV2Relation should be resolved to temp view first
> --
>
> Key: SPARK-30282
> URL: https://issues.apache.org/jira/browse/SPARK-30282
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
>
> For the following v2 commands, _Analyzer.ResolveTables_ does not check 
> against the temp views before resolving _UnresolvedV2Relation_, thus it 
> always resolves _UnresolvedV2Relation_ to a table:
>  * ALTER TABLE
>  * DESCRIBE TABLE
>  * SHOW TBLPROPERTIES
> Thus, in the following example, 't' will be resolved to a table, not a temp 
> view:
> {code:java}
> sql("CREATE TEMPORARY VIEW t AS SELECT 2 AS i")
> sql("CREATE TABLE testcat.ns.t USING csv AS SELECT 1 AS i")
> sql("USE testcat.ns")
> sql("SHOW TBLPROPERTIES t") // 't' is resolved to a table
> {code}
> For V2 commands, if a table is resolved to a temp view, it should error out 
> with a message that v2 command cannot handle temp views.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30548) Cached blockInfo in BlockMatrix.scala is never released

2020-01-17 Thread Dong Wang (Jira)

Dong Wang created SPARK-30548:
-

 Summary: Cached blockInfo in BlockMatrix.scala is never released
 Key: SPARK-30548
 URL: https://issues.apache.org/jira/browse/SPARK-30548
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.4.4
Reporter: Dong Wang


The private variable _blockInfo_ in mllib.linalg.distribtued.BlockMatrix is 
never unpersisted since a BlockMatrix instance is created.

{code:scala}
 private lazy val blockInfo = blocks.mapValues(block => (block.numRows, 
block.numCols)).cache()
{code}

I think we should add an API to unpersist this variable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30530) CSV load followed by "is null" filter produces incorrect results

2020-01-17 Thread Maxim Gekk (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017806#comment-17017806
 ] 

Maxim Gekk commented on SPARK-30530:


[~jlowe] I prepared a fix for the issue. [~hyukjin.kwon] [~cloud_fan] Could you 
review it, please.

> CSV load followed by "is null" filter produces incorrect results
> 
>
> Key: SPARK-30530
> URL: https://issues.apache.org/jira/browse/SPARK-30530
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Jason Darrell Lowe
>Priority: Major
>
> Trying to filter on is null from values loaded from a CSV file has regressed 
> recently and now produces incorrect results.
> Given a CSV file with the contents:
> {noformat:title=floats.csv}
> 100.0,1.0,
> 200.0,,
> 300.0,3.0,
> 1.0,4.0,
> ,4.0,
> 500.0,,
> ,6.0,
> -500.0,50.5
>  {noformat}
> Filtering this data for the first column being null should return exactly two 
> rows, but it is returning extraneous rows with nulls:
> {noformat}
> scala> val schema = StructType(Array(StructField("floats", FloatType, 
> true),StructField("more_floats", FloatType, true)))
> schema: org.apache.spark.sql.types.StructType = 
> StructType(StructField(floats,FloatType,true), 
> StructField(more_floats,FloatType,true))
> scala> val df = spark.read.schema(schema).csv("floats.csv")
> df: org.apache.spark.sql.DataFrame = [floats: float, more_floats: float]
> scala> df.filter("floats is null").show
> +--+---+
> |floats|more_floats|
> +--+---+
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|   null|
> |  null|4.0|
> |  null|   null|
> |  null|6.0|
> +--+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30547) Add unstable annotation to the CalendarInterval class

2020-01-17 Thread Kent Yao (Jira)

Kent Yao created SPARK-30547:


 Summary: Add unstable annotation to the CalendarInterval class
 Key: SPARK-30547
 URL: https://issues.apache.org/jira/browse/SPARK-30547
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao


People already use CalendarInterval as UDF inputs so it’s better to make it 
clear it’s unstable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30544) Upgrade Genjavadoc to 0.15

2020-01-17 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-30544:
---
Parent: SPARK-25075
Issue Type: Sub-task  (was: Improvement)

> Upgrade Genjavadoc to 0.15
> --
>
> Key: SPARK-30544
> URL: https://issues.apache.org/jira/browse/SPARK-30544
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build.
> Let's upgrade it to 0.15.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30542) Two Spark structured streaming jobs cannot write to same base path

2020-01-17 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017791#comment-17017791
 ] 

Jungtaek Lim commented on SPARK-30542:
--

This is more likely a question rather than actual bug which is encouraged to 
post user/dev mailing list to ask about.

> Two Spark structured streaming jobs cannot write to same base path
> --
>
> Key: SPARK-30542
> URL: https://issues.apache.org/jira/browse/SPARK-30542
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Sivakumar
>Priority: Major
>
> Hi All,
> I have two structured streaming jobs which should write data to the same base 
> directory.
> As __spark___metadata directory will be created by default for one job, 
> second job cannot use the same directory as base path as already 
> _spark__metadata directory is created by other job, It is throwing exception.
> Is there any workaround for this, other than creating separate base path's 
> for both the jobs.
> Is it possible to create the __spark__metadata directory else where or 
> disable without any data loss.
> If I had to change the base path for both the jobs, then my whole framework 
> will get impacted, So i don't want to do that.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30546) Make interval type more future-proofing

2020-01-17 Thread Kent Yao (Jira)

Kent Yao created SPARK-30546:


 Summary: Make interval type more future-proofing
 Key: SPARK-30546
 URL: https://issues.apache.org/jira/browse/SPARK-30546
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Kent Yao




Before 3.0 we maymake some efforts for the current interval type to make it
more future-proofing. e.g.
1. add unstable annotation to the CalendarInterval class. People already use
it as UDF inputs so it’s better to make it clear it’s unstable.
2. Add a schema checker to prohibit create v2 custom catalog table with
intervals, as same as what we do for the builtin catalog
3. Add a schema checker for DataFrameWriterV2 too
4. Make the interval type incomparable as version 2.4 for disambiguation of
comparison between year-month and day-time fields
5. The 3.0 newly added to_csv should not support output intervals as same as
using CSV file format
6. The function to_json should not allow using interval as a key field as
same as the value field and JSON datasource, with a legacy config to
restore.
7. Revert interval ISO/ANSI SQL Standard output since we decide not to
follow ANSI, so there is no round trip.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30544) Upgrade Genjavadoc to 0.15

2020-01-17 Thread Kousuke Saruta (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-30544:
---
Affects Version/s: (was: 3.1.0)
   3.0.0

> Upgrade Genjavadoc to 0.15
> --
>
> Key: SPARK-30544
> URL: https://issues.apache.org/jira/browse/SPARK-30544
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 3.0.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
>
> Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build.
> Let's upgrade it to 0.15.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-30543) RandomForest add Param bootstrap to control sampling method

2020-01-17 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30543?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-30543:
-
Issue Type: Improvement  (was: Bug)

> RandomForest add Param bootstrap to control sampling method
> ---
>
> Key: SPARK-30543
> URL: https://issues.apache.org/jira/browse/SPARK-30543
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Priority: Minor
>
> Current RF with numTrees=1 will directly build a tree using the orignial 
> dataset,
> while with numTrees>1 it will use bootstrap samples to build trees.
> This design is to train a DecisionTreeModel by the impl of RandomForest, 
> however, it is somewhat strange.
> In Scikit-Learn, there is a param bootstrap to control bootstrap samples are 
> used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30545) Impl Extremely Randomized Trees

2020-01-17 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-30545:


 Summary: Impl Extremely Randomized Trees
 Key: SPARK-30545
 URL: https://issues.apache.org/jira/browse/SPARK-30545
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 3.0.0
Reporter: zhengruifeng


1, Extremely Randomized Trees or ExtraTrees is widely used and impled in 
Scikit-Learn and OpenCV;

2, ExtraTrees is quite similar to RandomForest, and the main difference lie in 
that,on each leaf, candidate splits (only one split in Scikit-Learn's impl) are 
drawn at random for each feature and the best of these randomly-chosen splits 
is selected.

Based on current impl of ensenble trees, it can be impled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-30545) Impl Extremely Randomized Trees

2020-01-17 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-30545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-30545:


Assignee: zhengruifeng

> Impl Extremely Randomized Trees
> ---
>
> Key: SPARK-30545
> URL: https://issues.apache.org/jira/browse/SPARK-30545
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 3.0.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> 1, Extremely Randomized Trees or ExtraTrees is widely used and impled in 
> Scikit-Learn and OpenCV;
> 2, ExtraTrees is quite similar to RandomForest, and the main difference lie 
> in that,on each leaf, candidate splits (only one split in Scikit-Learn's 
> impl) are drawn at random for each feature and the best of these 
> randomly-chosen splits is selected.
> Based on current impl of ensenble trees, it can be impled.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-30544) Upgrade Genjavadoc to 0.15

2020-01-17 Thread Kousuke Saruta (Jira)

Kousuke Saruta created SPARK-30544:
--

 Summary: Upgrade Genjavadoc to 0.15
 Key: SPARK-30544
 URL: https://issues.apache.org/jira/browse/SPARK-30544
 Project: Spark
  Issue Type: Improvement
  Components: Build
Affects Versions: 3.1.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta


Genjavadoc 0.14 doesn't support Scala so sbt -Pscala-2.13 will fail to build.
Let's upgrade it to 0.15.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30462) Structured Streaming _spark_metadata fills up Spark Driver memory when having lots of objects

2020-01-17 Thread Sivakumar (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-30462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17017782#comment-17017782
 ] 

Sivakumar commented on SPARK-30462:
---

Hi All,

I have two structured streaming jobs which should write data to the same base 
directory.

As __spark___metadata directory will be created by default for one job, second 
job cannot use the same directory as base path as already _spark__metadata 
directory is created by other job, It is throwing exception.

Is there any workaround for this, other than creating separate base path's for 
both the jobs.

Is it possible to create the __spark__metadata directory else where or disable 
without any data loss.

If I had to change the base path for both the jobs, then my whole framework 
will get impacted, So i don't want to do that.

(SPARK-30542) - I have created a separate ticket for this.

> Structured Streaming _spark_metadata fills up Spark Driver memory when having 
> lots of objects
> -
>
> Key: SPARK-30462
> URL: https://issues.apache.org/jira/browse/SPARK-30462
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.3, 2.4.4, 3.0.0
>Reporter: Vladimir Yankov
>Priority: Critical
>
> Hi,
> With the current implementation of the Spark Structured Streaming it does not 
> seem to be possible to have a constantly running stream, writing millions of 
> files, without increasing the spark driver's memory to dozens of GB's.
> In our scenario we are using Spark structured streaming to consume messages 
> from a Kafka cluster, transform them, and write them as compressed Parquet 
> files in an S3 Objectstore Service.
> Each 30 seconds a new batch of the spark-streaming is writing hundreds of 
> objects, which respectively results within time to millions of objects in S3.
> As all written objects are recorded in the _spark_metadata, the size of the 
> compact files there grows to GB's that eventually fill up the Spark Driver's 
> memory and lead to OOM errors.
> We need the functionality to configure the spark structured streaming to run 
> without loading all the historically accumulated metadata in its memory. 
> Regularly resetting the _spark_metadata and the checkpoint folders is not an 
> option in our use-case, as we are using the information from the 
> _spark_metadata to have a register of the objects for faster querying and 
> search of the written objects.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

70 matches

Mail list logo