date:20210204

[jira] [Resolved] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34371.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31478
[https://github.com/apache/spark/pull/31478]

> Run datetime rebasing tests for parquet DSv1 and DSv2
> -
>
> Key: SPARK-34371
> URL: https://issues.apache.org/jira/browse/SPARK-34371
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
> test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34371:
---

Assignee: Maxim Gekk

> Run datetime rebasing tests for parquet DSv1 and DSv2
> -
>
> Key: SPARK-34371
> URL: https://issues.apache.org/jira/browse/SPARK-34371
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
>
> Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
> test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34374:


Assignee: (was: Apache Spark)

> Use standard methods to extract keys or values from a Map.
> --
>
> Key: SPARK-34374
> URL: https://issues.apache.org/jira/browse/SPARK-34374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> For keys：
> *before* 
> {code:scala}
> map.map(_._1)
> {code}
> *after*
> {code:java}
> map.keys
> {code}
> For values:
> {code:scala}
> map.map(_._2)
> {code}
> *after*
> {code:java}
> map.values
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34374:


Assignee: Apache Spark

> Use standard methods to extract keys or values from a Map.
> --
>
> Key: SPARK-34374
> URL: https://issues.apache.org/jira/browse/SPARK-34374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Assignee: Apache Spark
>Priority: Minor
>
> For keys：
> *before* 
> {code:scala}
> map.map(_._1)
> {code}
> *after*
> {code:java}
> map.keys
> {code}
> For values:
> {code:scala}
> map.map(_._2)
> {code}
> *after*
> {code:java}
> map.values
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279401#comment-17279401
 ] 

Apache Spark commented on SPARK-34374:
--

User 'LuciferYang' has created a pull request for this issue:
https://github.com/apache/spark/pull/31484

> Use standard methods to extract keys or values from a Map.
> --
>
> Key: SPARK-34374
> URL: https://issues.apache.org/jira/browse/SPARK-34374
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Yang Jie
>Priority: Minor
>
> For keys：
> *before* 
> {code:scala}
> map.map(_._1)
> {code}
> *after*
> {code:java}
> map.keys
> {code}
> For values:
> {code:scala}
> map.map(_._2)
> {code}
> *after*
> {code:java}
> map.values
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33434) Document spark.conf.isModifiable()

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279398#comment-17279398
 ] 

Apache Spark commented on SPARK-33434:
--

User 'Eric-Lemmon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31483

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33434) Document spark.conf.isModifiable()

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33434:


Assignee: (was: Apache Spark)

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33434) Document spark.conf.isModifiable()

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-33434:


Assignee: Apache Spark

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Assignee: Apache Spark
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33434) Document spark.conf.isModifiable()

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279397#comment-17279397
 ] 

Apache Spark commented on SPARK-33434:
--

User 'Eric-Lemmon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31483

> Document spark.conf.isModifiable()
> --
>
> Key: SPARK-33434
> URL: https://issues.apache.org/jira/browse/SPARK-33434
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark
>Affects Versions: 3.1.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> PySpark's docs make no mention of {{conf.isModifiable()}}, though it appears 
> to be a public method introduced in SPARK-24761.
> http://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.SparkSession.conf



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34374) Use standard methods to extract keys or values from a Map.

2021-02-04 Thread Yang Jie (Jira)

Yang Jie created SPARK-34374:


 Summary: Use standard methods to extract keys or values from a Map.
 Key: SPARK-34374
 URL: https://issues.apache.org/jira/browse/SPARK-34374
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yang Jie


For keys：
*before* 

{code:scala}
map.map(_._1)
{code}

*after*

{code:java}
map.keys
{code}

For values:
{code:scala}
map.map(_._2)
{code}

*after*

{code:java}
map.values
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34346) io.file.buffer.size set by spark.buffer.size will override by hive-site.xml may cause perf regression

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279367#comment-17279367
 ] 

Apache Spark commented on SPARK-34346:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31482

> io.file.buffer.size set by spark.buffer.size will override by hive-site.xml 
> may cause perf regression
> -
>
> Key: SPARK-34346
> URL: https://issues.apache.org/jira/browse/SPARK-34346
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 3.1.1
>Reporter: Kent Yao
>Priority: Blocker
>
> In many real-world cases, when interacting with hive catalog through Spark 
> SQL, users may just share the `hive-site.xml` for their hive jobs and make a 
> copy to `SPARK_HOME`/conf w/o modification. In Spark, when we generate Hadoop 
> configurations, we will use `spark.buffer.size(65536)` to reset  
> `io.file.buffer.size(4096)`. But when we load the hive-site.xml, we may 
> ignore this behavior and reset `io.file.buffer.size` again according to 
> `hive-site.xml`.
> 1. The configuration priority for setting Hadoop and Hive config here is not 
> right, while literally, the order should be `spark > spark.hive > 
> spark.hadoop > hive > hadoop`
> 2. This breaks `spark.buffer.size` congfig's behavior for tuning the IO 
> performance w/ HDFS if there is an existing `io.file.buffer.size` in 
> hive-site.xml 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34343) Add missing test for some non-array types in PostgreSQL

2021-02-04 Thread Takeshi Yamamuro (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro resolved SPARK-34343.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/31456

> Add missing test for some non-array types in PostgreSQL
> ---
>
> Key: SPARK-34343
> URL: https://issues.apache.org/jira/browse/SPARK-34343
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL, Tests
>Affects Versions: 3.2.0
>Reporter: Kousuke Saruta
>Assignee: Kousuke Saruta
>Priority: Minor
> Fix For: 3.2.0
>
>
> PostgresIntegrationSuite tests some non-array types for PostgreSQL but tests 
> for some types are missing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26399) Add new stage-level REST APIs and parameters

2021-02-04 Thread Ron Hu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-26399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ron Hu updated SPARK-26399:
---
Description: 
Add the peak values for the metrics to the stages REST API. Also add a new 
executorSummary REST API, which will return executor summary metrics for a 
specified stage:
{code:java}
curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code}
Add parameters to the stages REST API to specify:
 * filtering for task status, and returning tasks that match (for example, 
FAILED tasks).
 * task metric quantiles, add adding the task summary if specified
 * executor metric quantiles, and adding the executor summary if specified

*. *. *

Note that the above description is too brief to be clear.  [~angerszhuuu] and 
[~ron8hu] discussed a generic and consistent way for endpoint 
/application/\{app-id}/stages.  It can be:

/application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|false]&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING]

where
 * query parameter details=true is to show the detailed task information within 
each stage.  The default value is details=false;
 * query parameter status can select those stages with the specified status.  
When status parameter is not specified, a list of all stages are generated.  
 * query parameter withSummaries=true is to show both task summary information 
in percentile distribution and executor summary information in percentile 
distribution.  The default value is withSummaries=false.
 * query parameter taskStatus is to show only those tasks with the specified 
status within their corresponding stages.  This parameter can be set when 
details=true (i.e. this parameter will be ignored when details=false).

  was:
Add the peak values for the metrics to the stages REST API. Also add a new 
executorSummary REST API, which will return executor summary metrics for a 
specified stage:
{code:java}
curl http://:18080/api/v1/applicationsexecutorMetricsSummary{code}
Add parameters to the stages REST API to specify:
 * filtering for task status, and returning tasks that match (for example, 
FAILED tasks).
 * task metric quantiles, add adding the task summary if specified
 * executor metric quantiles, and adding the executor summary if specified

*. *. *

Note that the above description is too brief to be clear.  [~angerszhuuu] and 
[~ron8hu] discussed a generic and consistent way for endpoint 
/application/\{app-id}/stages.  It can be:

/application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|false]&taskStatus=[RUNNING|SUCCESS|FAILED|KILLED|PENDING]

where
 * query parameter details=true is to show the detailed task information within 
each stage.  The default value is details=false;
 * query parameter status can select those stages with the specified status.  
When status parameter is not specified, a list of all stages are generated.  
 * query parameter withSummaries=true is to show both task summary information 
in percentile distribution and executor summary information in percentile 
distribution.  The default value is withSummaries=false.
 * query parameter taskStatus is to show only those tasks with the specified 
status within their corresponding stages.  This parameter will be set when 
details=true (i.e. this parameter will be ignored when details=false).


> Add new stage-level REST APIs and parameters
> 
>
> Key: SPARK-26399
> URL: https://issues.apache.org/jira/browse/SPARK-26399
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: Edward Lu
>Priority: Major
> Attachments: executorMetricsSummary.json, 
> lispark230_restapi_ex2_stages_failedTasks.json, 
> lispark230_restapi_ex2_stages_withSummaries.json, 
> stage_executorSummary_image1.png
>
>
> Add the peak values for the metrics to the stages REST API. Also add a new 
> executorSummary REST API, which will return executor summary metrics for a 
> specified stage:
> {code:java}
> curl http:// server>:18080/api/v1/applicationsexecutorMetricsSummary{code}
> Add parameters to the stages REST API to specify:
>  * filtering for task status, and returning tasks that match (for example, 
> FAILED tasks).
>  * task metric quantiles, add adding the task summary if specified
>  * executor metric quantiles, and adding the executor summary if specified
> *. *. *
> Note that the above description is too brief to be clear.  [~angerszhuuu] and 
> [~ron8hu] discussed a generic and consistent way for endpoint 
> /application/\{app-id}/stages.  It can be:
> /application/\{app-id}/stages?details=[true|false]&status=[ACTIVE|COMPLETE|FAILED|PENDING|SKIPPED]&withSummaries=[true|f

[jira] [Resolved] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34359.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31474
[https://github.com/apache/spark/pull/31474]

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread jiaan.geng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jiaan.geng updated SPARK-34359:
---
Parent: SPARK-34156
Issue Type: Sub-task  (was: Improvement)

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34330) Literal constructor support UTFString

2021-02-04 Thread angerszhu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

angerszhu resolved SPARK-34330.
---
Resolution: Won't Fix

> Literal constructor support UTFString 
> --
>
> Key: SPARK-34330
> URL: https://issues.apache.org/jira/browse/SPARK-34330
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Priority: Major
>
> Literal constructor support UTFString



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32384) repartitionAndSortWithinPartitions avoid shuffle with same partitioner

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279293#comment-17279293
 ] 

Apache Spark commented on SPARK-32384:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31480

> repartitionAndSortWithinPartitions avoid shuffle with same partitioner
> --
>
> Key: SPARK-32384
> URL: https://issues.apache.org/jira/browse/SPARK-32384
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In {{combineByKeyWithClassTag}}, there is a check so that if the partitioner 
> is the same as the one of the RDD:
> {code:java}
> if (self.partitioner == Some(partitioner)) {
>   self.mapPartitions(iter => {
> val context = TaskContext.get()
> new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, 
> context))
>   }, preservesPartitioning = true)
> } else {
>   new ShuffledRDD[K, V, C](self, partitioner)
> .setSerializer(serializer)
> .setAggregator(aggregator)
> .setMapSideCombine(mapSideCombine)
> }
>  {code}
>  
> In {{repartitionAndSortWithinPartitions}}, this shuffle can also be skipped 
> in this case.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32384) repartitionAndSortWithinPartitions avoid shuffle with same partitioner

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279292#comment-17279292
 ] 

Apache Spark commented on SPARK-32384:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31480

> repartitionAndSortWithinPartitions avoid shuffle with same partitioner
> --
>
> Key: SPARK-32384
> URL: https://issues.apache.org/jira/browse/SPARK-32384
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.0
>Reporter: zhengruifeng
>Priority: Minor
>
> In {{combineByKeyWithClassTag}}, there is a check so that if the partitioner 
> is the same as the one of the RDD:
> {code:java}
> if (self.partitioner == Some(partitioner)) {
>   self.mapPartitions(iter => {
> val context = TaskContext.get()
> new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, 
> context))
>   }, preservesPartitioning = true)
> } else {
>   new ShuffledRDD[K, V, C](self, partitioner)
> .setSerializer(serializer)
> .setAggregator(aggregator)
> .setMapSideCombine(mapSideCombine)
> }
>  {code}
>  
> In {{repartitionAndSortWithinPartitions}}, this shuffle can also be skipped 
> in this case.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279280#comment-17279280
 ] 

Apache Spark commented on SPARK-34373:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31479

> HiveThriftServer2 startWithContext may hang with a race issue 
> --
>
> Key: SPARK-34373
> URL: https://issues.apache.org/jira/browse/SPARK-34373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error 
> occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: No underlying server socket.
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>   at org.apache.thrift.transport.TServerTransport.acceException in thread 
> "Thread-15" java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
>   at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
>   at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
>   at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
> ```
> the TServer might try to serve even the stop is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34373:


Assignee: Apache Spark

> HiveThriftServer2 startWithContext may hang with a race issue 
> --
>
> Key: SPARK-34373
> URL: https://issues.apache.org/jira/browse/SPARK-34373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Assignee: Apache Spark
>Priority: Major
>
> ```
> 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error 
> occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: No underlying server socket.
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>   at org.apache.thrift.transport.TServerTransport.acceException in thread 
> "Thread-15" java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
>   at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
>   at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
>   at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
> ```
> the TServer might try to serve even the stop is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279279#comment-17279279
 ] 

Apache Spark commented on SPARK-34373:
--

User 'yaooqinn' has created a pull request for this issue:
https://github.com/apache/spark/pull/31479

> HiveThriftServer2 startWithContext may hang with a race issue 
> --
>
> Key: SPARK-34373
> URL: https://issues.apache.org/jira/browse/SPARK-34373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error 
> occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: No underlying server socket.
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>   at org.apache.thrift.transport.TServerTransport.acceException in thread 
> "Thread-15" java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
>   at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
>   at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
>   at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
> ```
> the TServer might try to serve even the stop is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34373:


Assignee: (was: Apache Spark)

> HiveThriftServer2 startWithContext may hang with a race issue 
> --
>
> Key: SPARK-34373
> URL: https://issues.apache.org/jira/browse/SPARK-34373
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.1, 3.1.0
>Reporter: Kent Yao
>Priority: Major
>
> ```
> 21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error 
> occurred during acceptance of message.
> org.apache.thrift.transport.TTransportException: No underlying server socket.
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
>   at 
> org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
>   at org.apache.thrift.transport.TServerTransport.acceException in thread 
> "Thread-15" java.io.IOException: Stream closed
>   at 
> java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
>   at java.io.FilterInputStream.read(FilterInputStream.java:107)
>   at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
>   at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
>   at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
>   at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
> ```
> the TServer might try to serve even the stop is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34373) HiveThriftServer2 startWithContext may hang with a race issue

2021-02-04 Thread Kent Yao (Jira)

Kent Yao created SPARK-34373:


 Summary: HiveThriftServer2 startWithContext may hang with a race 
issue 
 Key: SPARK-34373
 URL: https://issues.apache.org/jira/browse/SPARK-34373
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 3.1.0
Reporter: Kent Yao


```
21:43:26.809 WARN org.apache.thrift.server.TThreadPoolServer: Transport error 
occurred during acceptance of message.
org.apache.thrift.transport.TTransportException: No underlying server socket.
at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:126)
at 
org.apache.thrift.transport.TServerSocket.acceptImpl(TServerSocket.java:35)
at org.apache.thrift.transport.TServerTransport.acceException in thread 
"Thread-15" java.io.IOException: Stream closed
at 
java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.read(BufferedInputStream.java:336)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at scala.sys.process.BasicIO$.loop$1(BasicIO.scala:238)
at scala.sys.process.BasicIO$.transferFullyImpl(BasicIO.scala:246)
at scala.sys.process.BasicIO$.transferFully(BasicIO.scala:227)
at scala.sys.process.BasicIO$.$anonfun$toStdOut$1(BasicIO.scala:221)
```
the TServer might try to serve even the stop is called



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-04 Thread Daehee Han (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daehee Han updated SPARK-34372:
---
Description: 
Hi, we've been experiencing some rows get corrupted while partitioned CSV files 
were written to Amazon S3. Some records were found broken without any error on 
Spark. Digging into the root cause, we found out Spark speculation tried to 
upload a partition being uploaded slowly and ended up uploading only a part of 
the partition, letting broken data uploaded to S3.

Here're stacktraces we've found. There are two executor involved - A: the first 
executor which tried to upload the file, but it took much longer than other 
executor (but still succeeded), which made spark speculation cut in and kick 
off another executor B. Executor B started to upload the file too, but was 
interrupted during uploading (killed: another attempt succeeded), and ended up 
uploading only a part of the whole file. You can see in the log, the file 
executor A uploaded (8461990 bytes originally) was overwritten by executor B 
(uploaded only 3145728 bytes).

 

Executor A:
{quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
13201) 
 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
blocks including 10 local blocks and 460 remote blocks 
 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 18 ms 
 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
 21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
 21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 8461990 bytes 
 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 10. 
 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task 
because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 
 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
13201). 8782 bytes result sent to driver
{quote}
Executor B:
{quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
13245) 
 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
blocks including 11 local blocks and 459 remote blocks 
 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 2 ms 
 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class 
org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
stage 45.0 (TID 13245), reason: another attempt succeeded 
 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 3145728 bytes 
 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 0. 
 21/01/28 17:22:32 ERROR Utils: Aborting task 
com.univocity.parsers.common.TextWritingException: Error writing row. Internal 
state when error was thrown: recordCount=18449, recordData=[
Unknown macro: \{obfuscated}
] at 
com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935)
 at 
com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) 
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84)
 at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181)
 at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
 at 
org.apache.spark.sql.execution.datasources.FileFo

[jira] [Created] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-04 Thread Daehee Han (Jira)

Daehee Han created SPARK-34372:
--

 Summary: Speculation results in broken CSV files in Amazon S3
 Key: SPARK-34372
 URL: https://issues.apache.org/jira/browse/SPARK-34372
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.4.7
 Environment: Amazon EMR with AMI version 5.32.0
Reporter: Daehee Han


Hi, we've been experiencing some rows get corrupted while partitioned CSV files 
were written to Amazon S3. Some records were found broken without any error on 
Spark. Digging into the root cause, we found out Spark speculation tried to 
upload a partition being uploaded slowly and ended up uploading only a part of 
the partition, letting broken data uploaded to S3.

Here're stacktraces we've found. There are two executor involved - A: the first 
executor which tried to upload the file, but it took much longer than other 
executor (but still succeeded), which made spark speculation cut in and kick 
off another executor B. Executor B started to upload the file too, but was 
interrupted during uploading (killed: another attempt succeeded), and ended up 
uploading only a part of the whole file.

 

Executor A:
{quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
13201) 
21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
blocks including 10 local blocks and 460 remote blocks 
21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 18 ms 
21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 8461990 bytes 
21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 10. 
21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task 
because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 
21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 13201). 
8782 bytes result sent to driver{quote}
Executor B:
{quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
13245) 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 
non-empty blocks including 11 local blocks and 459 remote blocks 
21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 2 ms 
21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class 
org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in stage 
45.0 (TID 13245), reason: another attempt succeeded 
21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 3145728 bytes 
21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 0. 
21/01/28 17:22:32 ERROR Utils: Aborting task 
com.univocity.parsers.common.TextWritingException: Error writing row. Internal 
state when error was thrown: recordCount=18449, recordData=[\{obfuscated}] at 
com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935)
 at 
com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) 
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84)
 at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181)
 at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scal

[jira] [Updated] (SPARK-34372) Speculation results in broken CSV files in Amazon S3

2021-02-04 Thread Daehee Han (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daehee Han updated SPARK-34372:
---
Description: 
Hi, we've been experiencing some rows get corrupted while partitioned CSV files 
were written to Amazon S3. Some records were found broken without any error on 
Spark. Digging into the root cause, we found out Spark speculation tried to 
upload a partition being uploaded slowly and ended up uploading only a part of 
the partition, letting broken data uploaded to S3.

Here're stacktraces we've found. There are two executor involved - A: the first 
executor which tried to upload the file, but it took much longer than other 
executor (but still succeeded), which made spark speculation cut in and kick 
off another executor B. Executor B started to upload the file too, but was 
interrupted during uploading (killed: another attempt succeeded), and ended up 
uploading only a part of the whole file.

 

Executor A:
{quote}21/01/28 17:22:21 INFO Executor: Running task 426.0 in stage 45.0 (TID 
13201) 
 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
blocks including 10 local blocks and 460 remote blocks 
 21/01/28 17:22:21 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 18 ms 
 21/01/28 17:22:21 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
 21/01/28 17:22:21 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
21/01/28 17:22:21 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
 21/01/28 17:22:21 INFO SQLConfCommitterProvider: Using output committer class
 21/01/28 17:22:21 INFO  INFO CSEMultipartUploadOutputStream: close 
closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv
 21/01/28 17:22:31 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 8461990 bytes 
 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 10. 
 21/01/28 17:22:31 INFO SparkHadoopMapRedUtil: No need to commit output of task 
because needsTaskCommit=false: attempt_20210128172219_0045_m_000426_13201 
 21/01/28 17:22:31 INFO Executor: Finished task 426.0 in stage 45.0 (TID 
13201). 8782 bytes result sent to driver
{quote}
Executor B:
{quote}21/01/28 17:22:31 INFO CoarseGrainedExecutorBackend: Got assigned task 
13245 21/01/28 17:22:31 INFO Executor: Running task 426.1 in stage 45.0 (TID 
13245) 
21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Getting 470 non-empty 
blocks including 11 local blocks and 459 remote blocks 
 21/01/28 17:22:31 INFO ShuffleBlockFetcherIterator: Started 46 remote fetches 
in 2 ms 
 21/01/28 17:22:31 INFO FileOutputCommitter: File Output Committer Algorithm 
version is 2 
 21/01/28 17:22:31 INFO FileOutputCommitter: FileOutputCommitter skip cleanup 
_temporary folders under output directory:false, ignore cleanup failures: true 
 21/01/28 17:22:31 INFO DirectFileOutputCommitter: Direct Write: ENABLED 
 21/01/28 17:22:31 INFO SQLConfCommitterProvider: Using output committer class 
org.apache.hadoop.mapreduce.lib.output.DirectFileOutputCommitter 
 21/01/28 17:22:31 INFO Executor: Executor is trying to kill task 426.1 in 
stage 45.0 (TID 13245), reason: another attempt succeeded 
 21/01/28 17:22:31 INFO CSEMultipartUploadOutputStream: close closed:false 
s3://\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv 
 21/01/28 17:22:32 INFO DefaultMultipartUploadDispatcher: Completed multipart 
upload of 1 parts 3145728 bytes 
 21/01/28 17:22:32 INFO CSEMultipartUploadOutputStream: Finished uploading 
\{obfuscated}/part-00426-7d5677a9-f740-4db6-9d3c-dc589d75e965-c000.csv. Elapsed 
seconds: 0. 
 21/01/28 17:22:32 ERROR Utils: Aborting task 
com.univocity.parsers.common.TextWritingException: Error writing row. Internal 
state when error was thrown: recordCount=18449, recordData=[\\{obfuscated}] at 
com.univocity.parsers.common.AbstractWriter.throwExceptionAndClose(AbstractWriter.java:935)
 at 
com.univocity.parsers.common.AbstractWriter.writeRow(AbstractWriter.java:714) 
at 
org.apache.spark.sql.execution.datasources.csv.UnivocityGenerator.write(UnivocityGenerator.scala:84)
 at 
org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.write(CSVFileFormat.scala:181)
 at 
org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:137)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:245)
 at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:242)
 at 
org.apache.spark.util.Utils

[jira] [Assigned] (SPARK-34339) Expose the number of truncated paths in Utils.buildLocationMetadata()

2021-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34339:


Assignee: Jungtaek Lim

> Expose the number of truncated paths in Utils.buildLocationMetadata()
> -
>
> Key: SPARK-34339
> URL: https://issues.apache.org/jira/browse/SPARK-34339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> SPARK-31793 introduces Utils.buildLocationMetadata() to reduce the length of 
> location metadata. It effectively reduced the memory usage as only a few 
> paths will be included (and it's controlled by threshold value), but there's 
> no indication of the fact for truncation.
> If the first 2 of 5 paths are only fit to the threshold, 
> Utils.buildLocationMetadata() shows the first 2 paths, but there's no mention 
> that 3 paths are truncated. Even no mention that some paths have been 
> truncated; it shows the same with the output just first 2 paths were 
> presented. This could bring confusion.
> We should allow more space (like 10+ chars) to represent the fact N paths are 
> truncated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34339) Expose the number of truncated paths in Utils.buildLocationMetadata()

2021-02-04 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34339.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31464
[https://github.com/apache/spark/pull/31464]

> Expose the number of truncated paths in Utils.buildLocationMetadata()
> -
>
> Key: SPARK-34339
> URL: https://issues.apache.org/jira/browse/SPARK-34339
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> SPARK-31793 introduces Utils.buildLocationMetadata() to reduce the length of 
> location metadata. It effectively reduced the memory usage as only a few 
> paths will be included (and it's controlled by threshold value), but there's 
> no indication of the fact for truncation.
> If the first 2 of 5 paths are only fit to the threshold, 
> Utils.buildLocationMetadata() shows the first 2 paths, but there's no mention 
> that 3 paths are truncated. Even no mention that some paths have been 
> truncated; it shows the same with the output just first 2 paths were 
> presented. This could bring confusion.
> We should allow more space (like 10+ chars) to represent the fact N paths are 
> truncated.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34371:


Assignee: Apache Spark

> Run datetime rebasing tests for parquet DSv1 and DSv2
> -
>
> Key: SPARK-34371
> URL: https://issues.apache.org/jira/browse/SPARK-34371
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
>
> Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
> test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34371:


Assignee: (was: Apache Spark)

> Run datetime rebasing tests for parquet DSv1 and DSv2
> -
>
> Key: SPARK-34371
> URL: https://issues.apache.org/jira/browse/SPARK-34371
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
> test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279207#comment-17279207
 ] 

Apache Spark commented on SPARK-34371:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31478

> Run datetime rebasing tests for parquet DSv1 and DSv2
> -
>
> Key: SPARK-34371
> URL: https://issues.apache.org/jira/browse/SPARK-34371
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Priority: Major
>
> Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
> test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old partition this new evolved 
schema must be used.

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old p


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via "avro.schema.url".
> With time the schema is evolved and the new schema is set for the table 
> "avro.schema.url" when data is read from the old partition this new evolved 
> schema must be used.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via "avro.schema.url".
With time the schema is evolved and the new schema is set for the table 
"avro.schema.url" when data is read from the old p

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 
https://github.com/apache/spark/pull/31133#discussion_r570561321


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via "avro.schema.url".
> With time the schema is evolved and the new schema is set for the table 
> "avro.schema.url" when data is read from the old p



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 
https://github.com/apache/spark/pull/31133#discussion_r570561321

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via 
> https://github.com/apache/spark/pull/31133#discussion_r570561321



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


The use case is the following there is a partitioned Hive table with Avro data. 
The schema is specified via 

  was:
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.



> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.
> The use case is the following there is a partitioned Hive table with Avro 
> data. The schema is specified via 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34371) Run datetime rebasing tests for parquet DSv1 and DSv2

2021-02-04 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-34371:
--

 Summary: Run datetime rebasing tests for parquet DSv1 and DSv2
 Key: SPARK-34371
 URL: https://issues.apache.org/jira/browse/SPARK-34371
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk


Extract datetime rebasing tests from ParquetIOSuite and place them a separate 
test suite to run it for both implementations DS v1 and v2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279156#comment-17279156
 ] 

Attila Zsolt Piros commented on SPARK-34370:


I am working on this.

> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Attila Zsolt Piros updated SPARK-34370:
---
Description: 
This came up in 
https://github.com/apache/spark/pull/31133#issuecomment-773567152.


> Supporting Avro schema evolution for partitioned Hive tables using 
> "avro.schema.url"
> 
>
> Key: SPARK-34370
> URL: https://issues.apache.org/jira/browse/SPARK-34370
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1, 2.4.0, 3.0.1, 3.1.0, 3.2.0
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> This came up in 
> https://github.com/apache/spark/pull/31133#issuecomment-773567152.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34370) Supporting Avro schema evolution for partitioned Hive tables using "avro.schema.url"

2021-02-04 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-34370:
--

 Summary: Supporting Avro schema evolution for partitioned Hive 
tables using "avro.schema.url"
 Key: SPARK-34370
 URL: https://issues.apache.org/jira/browse/SPARK-34370
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1, 2.4.0, 2.3.1, 3.1.0, 3.2.0
Reporter: Attila Zsolt Piros






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook

2021-02-04 Thread Huseyin Elci (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279128#comment-17279128
 ] 

Huseyin Elci commented on SPARK-34351:
--

I used StackOverflow for this issue but I didn't find anything. I spent over 
the 3 days for solving of this issue.
I looked http://spark.apache.org/community.html. And It has lots of  
"Py4JJavaError" error. I checked a few comment. Almost of there is not same 
issue or there is not solving about another error of "Py4JJavaError" 

> Running into "Py4JJavaError" while counting to text file or list using 
> Pyspark, Jupyter notebook
> 
>
> Key: SPARK-34351
> URL: https://issues.apache.org/jira/browse/SPARK-34351
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: PS> python --version
>  *Python 3.6.8*
> PS> jupyter --version
>  j*upyter core : 4.7.0*
>  *jupyter-notebook : 6.2.0*
>  qtconsole : 5.0.2
>  ipython : 7.16.1
>  ipykernel : 5.4.3
>  jupyter client : 6.1.11
>  jupyter lab : not installed
>  nbconvert : 6.0.7
>  ipywidgets : 7.6.3
>  nbformat : 5.1.2
>  traitlets : 4.3.3
> PS > java -version
>  *java version "1.8.0_271"*
>  Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)
>  
> Spark versiyon
> *spark-2.3.1-bin-hadoop2.7*
>Reporter: Huseyin Elci
>Priority: Major
>
> I run into the following error: 
>  Any help resolving this error is greatly appreciated.
>  *My Code 1:*
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark.sql import SparkSession
> from pyspark.conf import SparkConf
> spark = SparkSession.builder\
> .master("local[4]")\
> .appName("WordCount_RDD")\
> .getOrCreate()
> sc = spark.sparkContext
> data = "D:\\05 Spark\\data\\MyArticle.txt"
> story_rdd = sc.textFile(data)
> story_rdd.count()
> {code}
> *My Code 2:* 
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark import SparkContext
> sc = SparkContext()
> mylist = [1,2,2,3,5,48,98,62,14,55]
> mylist_rdd = sc.parallelize(mylist)
> mylist_rdd.map(lambda x: x*x)
> mylist_rdd.map(lambda x: x*x).collect()
> {code}
> *ERROR:*
> I took same error code for my codes.
> {code:python}
>  ---
>  Py4JJavaError Traceback (most recent call last)
>   in 
>  > 1 story_rdd.count()
> C:\Spark\python\pyspark\rdd.py in count(self)
>  1071 3
>  1072 """
>  -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>  1074 
>  1075 def stats(self):
> C:\Spark\python\pyspark\rdd.py in sum(self)
>  1062 6.0
>  1063 """
>  -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>  1065 
>  1066 def count(self):
> C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op)
>  933 # zeroValue provided to each partition is unique from the one provided
>  934 # to the final reduce call
>  --> 935 vals = self.mapPartitions(func).collect()
>  936 return reduce(op, vals, zeroValue)
>  937
> C:\Spark\python\pyspark\rdd.py in collect(self)
>  832 """
>  833 with SCCallSiteSync(self.context) as css:
>  --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  835 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
>  836
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in 
> __call__(self, *args)
>  1255 answer = self.gateway_client.send_command(command)
>  1256 return_value = get_return_value(
>  -> 1257 answer, self.gateway_client, self.target_id, self.name)
>  1258 
>  1259 for temp_arg in temp_args:
> C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
>  ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
>  326 raise Py4JJavaError(
>  327 "An error occurred while calling
> {0} \{1} \{2}
> .\n".
>  --> 328 format(target_id, ".", name), value)
>  329 else:
>  330 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python 
> worker failed to connect back.
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)
>  at

[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279126#comment-17279126
 ] 

Apache Spark commented on SPARK-34369:
--

User 'sririshindra' has created a pull request for this issue:
https://github.com/apache/spark/pull/31477

> Track number of pairs processed out of Join
> ---
>
> Key: SPARK-34369
> URL: https://issues.apache.org/jira/browse/SPARK-34369
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
>
> Often users face a scenario where even a modest skew in a join can lead to 
> tasks appearing to be stuck, due to the O(n^2) nature of a join considering 
> all pairs of rows with matching keys. When this happens users think that 
> spark has gotten deadlocked. If there is a bound condition, the "number of 
> output rows" metric may look typical. Other metrics may look very modest (eg: 
> shuffle read). In those cases, it is very hard to understand what the problem 
> is. There is no conclusive proof without getting a heap dump and looking at 
> some internal data structures.
> It would be much better if spark had a metric(which we propose be titled 
> “number of matched pairs” as a companion to “number of output rows”) which 
> showed the user how many pairs were being processed in the join. This would 
> get updated in the live UI (when metrics get collected during heartbeats), so 
> the user could easily see what was going on.
> This would even help in cases where there was some other cause of a stuck 
> executor (eg. network issues) just to disprove this theory. For example, you 
> may have 100k records with the same key on each side of a join. That probably 
> won't really show up as extreme skew in task input data. But it'll become 10B 
> join pairs that spark works through, in one task.
>  
> To further demonstrate the usefulness of this metric please follow the steps 
> below.
>  
> _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", 
> "c")_
> _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
>  
> _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_
> _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_
>  
> _val df5 = df1.union(df2)_
> _val df6 = df3.union(df4)_
>  
> _df5.createOrReplaceTempView("table1")_
> _df6.createOrReplaceTempView("table2")_
> h3. InnerJoin
> _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > 
> p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. FullOuterJoin
> _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 6,099,964_
> _number of matched pairs: 90,000,490,000_
> h3. LeftOuterJoin
> _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 6,079,964_
> _number of matched pairs: 90,000,490,000_
> h3. RightOuterJoin
> _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b 
> = p.b and f.c > p.c").count_
> _number of output rows: 5,600,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftSemiJoin
> _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 36_
> _number of matched pairs: 89,994,910,036_
> h3. CrossJoin
> _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftAntiJoin
> _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
> p.c").count_
> number of output rows: 499,964
> number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34369:


Assignee: Apache Spark

> Track number of pairs processed out of Join
> ---
>
> Key: SPARK-34369
> URL: https://issues.apache.org/jira/browse/SPARK-34369
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Srinivas Rishindra Pothireddi
>Assignee: Apache Spark
>Priority: Major
>
> Often users face a scenario where even a modest skew in a join can lead to 
> tasks appearing to be stuck, due to the O(n^2) nature of a join considering 
> all pairs of rows with matching keys. When this happens users think that 
> spark has gotten deadlocked. If there is a bound condition, the "number of 
> output rows" metric may look typical. Other metrics may look very modest (eg: 
> shuffle read). In those cases, it is very hard to understand what the problem 
> is. There is no conclusive proof without getting a heap dump and looking at 
> some internal data structures.
> It would be much better if spark had a metric(which we propose be titled 
> “number of matched pairs” as a companion to “number of output rows”) which 
> showed the user how many pairs were being processed in the join. This would 
> get updated in the live UI (when metrics get collected during heartbeats), so 
> the user could easily see what was going on.
> This would even help in cases where there was some other cause of a stuck 
> executor (eg. network issues) just to disprove this theory. For example, you 
> may have 100k records with the same key on each side of a join. That probably 
> won't really show up as extreme skew in task input data. But it'll become 10B 
> join pairs that spark works through, in one task.
>  
> To further demonstrate the usefulness of this metric please follow the steps 
> below.
>  
> _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", 
> "c")_
> _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
>  
> _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_
> _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_
>  
> _val df5 = df1.union(df2)_
> _val df6 = df3.union(df4)_
>  
> _df5.createOrReplaceTempView("table1")_
> _df6.createOrReplaceTempView("table2")_
> h3. InnerJoin
> _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > 
> p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. FullOuterJoin
> _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 6,099,964_
> _number of matched pairs: 90,000,490,000_
> h3. LeftOuterJoin
> _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 6,079,964_
> _number of matched pairs: 90,000,490,000_
> h3. RightOuterJoin
> _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b 
> = p.b and f.c > p.c").count_
> _number of output rows: 5,600,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftSemiJoin
> _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 36_
> _number of matched pairs: 89,994,910,036_
> h3. CrossJoin
> _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftAntiJoin
> _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
> p.c").count_
> number of output rows: 499,964
> number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34369:


Assignee: (was: Apache Spark)

> Track number of pairs processed out of Join
> ---
>
> Key: SPARK-34369
> URL: https://issues.apache.org/jira/browse/SPARK-34369
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
>
> Often users face a scenario where even a modest skew in a join can lead to 
> tasks appearing to be stuck, due to the O(n^2) nature of a join considering 
> all pairs of rows with matching keys. When this happens users think that 
> spark has gotten deadlocked. If there is a bound condition, the "number of 
> output rows" metric may look typical. Other metrics may look very modest (eg: 
> shuffle read). In those cases, it is very hard to understand what the problem 
> is. There is no conclusive proof without getting a heap dump and looking at 
> some internal data structures.
> It would be much better if spark had a metric(which we propose be titled 
> “number of matched pairs” as a companion to “number of output rows”) which 
> showed the user how many pairs were being processed in the join. This would 
> get updated in the live UI (when metrics get collected during heartbeats), so 
> the user could easily see what was going on.
> This would even help in cases where there was some other cause of a stuck 
> executor (eg. network issues) just to disprove this theory. For example, you 
> may have 100k records with the same key on each side of a join. That probably 
> won't really show up as extreme skew in task input data. But it'll become 10B 
> join pairs that spark works through, in one task.
>  
> To further demonstrate the usefulness of this metric please follow the steps 
> below.
>  
> _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", 
> "c")_
> _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
>  
> _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_
> _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_
>  
> _val df5 = df1.union(df2)_
> _val df6 = df3.union(df4)_
>  
> _df5.createOrReplaceTempView("table1")_
> _df6.createOrReplaceTempView("table2")_
> h3. InnerJoin
> _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > 
> p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. FullOuterJoin
> _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 6,099,964_
> _number of matched pairs: 90,000,490,000_
> h3. LeftOuterJoin
> _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 6,079,964_
> _number of matched pairs: 90,000,490,000_
> h3. RightOuterJoin
> _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b 
> = p.b and f.c > p.c").count_
> _number of output rows: 5,600,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftSemiJoin
> _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 36_
> _number of matched pairs: 89,994,910,036_
> h3. CrossJoin
> _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftAntiJoin
> _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
> p.c").count_
> number of output rows: 499,964
> number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279125#comment-17279125
 ] 

Apache Spark commented on SPARK-34369:
--

User 'sririshindra' has created a pull request for this issue:
https://github.com/apache/spark/pull/31477

> Track number of pairs processed out of Join
> ---
>
> Key: SPARK-34369
> URL: https://issues.apache.org/jira/browse/SPARK-34369
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
>
> Often users face a scenario where even a modest skew in a join can lead to 
> tasks appearing to be stuck, due to the O(n^2) nature of a join considering 
> all pairs of rows with matching keys. When this happens users think that 
> spark has gotten deadlocked. If there is a bound condition, the "number of 
> output rows" metric may look typical. Other metrics may look very modest (eg: 
> shuffle read). In those cases, it is very hard to understand what the problem 
> is. There is no conclusive proof without getting a heap dump and looking at 
> some internal data structures.
> It would be much better if spark had a metric(which we propose be titled 
> “number of matched pairs” as a companion to “number of output rows”) which 
> showed the user how many pairs were being processed in the join. This would 
> get updated in the live UI (when metrics get collected during heartbeats), so 
> the user could easily see what was going on.
> This would even help in cases where there was some other cause of a stuck 
> executor (eg. network issues) just to disprove this theory. For example, you 
> may have 100k records with the same key on each side of a join. That probably 
> won't really show up as extreme skew in task input data. But it'll become 10B 
> join pairs that spark works through, in one task.
>  
> To further demonstrate the usefulness of this metric please follow the steps 
> below.
>  
> _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", 
> "c")_
> _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
>  
> _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_
> _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_
>  
> _val df5 = df1.union(df2)_
> _val df6 = df3.union(df4)_
>  
> _df5.createOrReplaceTempView("table1")_
> _df6.createOrReplaceTempView("table2")_
> h3. InnerJoin
> _sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > 
> p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. FullOuterJoin
> _spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 6,099,964_
> _number of matched pairs: 90,000,490,000_
> h3. LeftOuterJoin
> _sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 6,079,964_
> _number of matched pairs: 90,000,490,000_
> h3. RightOuterJoin
> _spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b 
> = p.b and f.c > p.c").count_
> _number of output rows: 5,600,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftSemiJoin
> _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 36_
> _number of matched pairs: 89,994,910,036_
> h3. CrossJoin
> _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftAntiJoin
> _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
> p.c").count_
> number of output rows: 499,964
> number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34366) Add metric interfaces to DS v2

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279114#comment-17279114
 ] 

Apache Spark commented on SPARK-34366:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/31476

> Add metric interfaces to DS v2
> --
>
> Key: SPARK-34366
> URL: https://issues.apache.org/jira/browse/SPARK-34366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Add a few public API change to DS v2, to make DS v2 scan can report metrics 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34366) Add metric interfaces to DS v2

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34366:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Add metric interfaces to DS v2
> --
>
> Key: SPARK-34366
> URL: https://issues.apache.org/jira/browse/SPARK-34366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Add a few public API change to DS v2, to make DS v2 scan can report metrics 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34366) Add metric interfaces to DS v2

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34366:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Add metric interfaces to DS v2
> --
>
> Key: SPARK-34366
> URL: https://issues.apache.org/jira/browse/SPARK-34366
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Add a few public API change to DS v2, to make DS v2 scan can report metrics 
> to Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-04 Thread Erik Krogen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erik Krogen updated SPARK-34365:

Description: 
When reading an Avro dataset (using the dataset's schema or by overriding it 
with 'avroSchema') or writing an Avro dataset with a provided schema by 
'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
field name.

This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
least on the write path, we would match the schemas by positionally 
("structural" comparison). While I agree that this is much more sensible for 
default behavior, I propose that we make this behavior configurable using an 
{{option}} for the Avro datasource. Even at the time that SPARK-27762 was 
handled, there was [interest in making this behavior 
configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251],
 but it appears it went unaddressed.

There is precedence for configurability of this behavior as seen in 
SPARK-32864, which added this support for ORC. Besides this precedence, the 
behavior of Hive is to perform matching positionally 
([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
 so this is behavior that Hadoop/Hive ecosystem users are familiar with:
{quote}
Hive is very forgiving about types: it will attempt to store whatever value 
matches the provided column in the equivalent column position in the new table. 
No matching is done on column names, for instance.
{quote}

  was:
When reading an Avro dataset (using the dataset's schema or by overriding it 
with 'avroSchema') or writing an Avro dataset with a provided schema by 
'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
field name.

This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
least on the write path, we would match the schemas by positionally 
("structural" comparison). While I agree that this is much more sensible for 
default behavior, I propose that we make this behavior configurable using an 
{{option}} for the Avro datasource.

There is precedence for configurability of this behavior as seen in 
SPARK-32864, which added this support for ORC. Besides this precedence, the 
behavior of Hive is to perform matching positionally 
([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
 so this is behavior that Hadoop/Hive ecosystem users are familiar with:
{quote}
Hive is very forgiving about types: it will attempt to store whatever value 
matches the provided column in the equivalent column position in the new table. 
No matching is done on column names, for instance.
{quote}


> Support configurable Avro schema field matching for positional or by-name
> -
>
> Key: SPARK-34365
> URL: https://issues.apache.org/jira/browse/SPARK-34365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> When reading an Avro dataset (using the dataset's schema or by overriding it 
> with 'avroSchema') or writing an Avro dataset with a provided schema by 
> 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
> field name.
> This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
> least on the write path, we would match the schemas by positionally 
> ("structural" comparison). While I agree that this is much more sensible for 
> default behavior, I propose that we make this behavior configurable using an 
> {{option}} for the Avro datasource. Even at the time that SPARK-27762 was 
> handled, there was [interest in making this behavior 
> configurable|https://github.com/apache/spark/pull/24635#issuecomment-494205251],
>  but it appears it went unaddressed.
> There is precedence for configurability of this behavior as seen in 
> SPARK-32864, which added this support for ORC. Besides this precedence, the 
> behavior of Hive is to perform matching positionally 
> ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
>  so this is behavior that Hadoop/Hive ecosystem users are familiar with:
> {quote}
> Hive is very forgiving about types: it will attempt to store whatever value 
> matches the provided column in the equivalent column position in the new 
> table. No matching is done on column names, for instance.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Srinivas Rishindra Pothireddi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Srinivas Rishindra Pothireddi updated SPARK-34369:
--
Description: 
Often users face a scenario where even a modest skew in a join can lead to 
tasks appearing to be stuck, due to the O(n^2) nature of a join considering all 
pairs of rows with matching keys. When this happens users think that spark has 
gotten deadlocked. If there is a bound condition, the "number of output rows" 
metric may look typical. Other metrics may look very modest (eg: shuffle read). 
In those cases, it is very hard to understand what the problem is. There is no 
conclusive proof without getting a heap dump and looking at some internal data 
structures.

It would be much better if spark had a metric(which we propose be titled 
“number of matched pairs” as a companion to “number of output rows”) which 
showed the user how many pairs were being processed in the join. This would get 
updated in the live UI (when metrics get collected during heartbeats), so the 
user could easily see what was going on.

This would even help in cases where there was some other cause of a stuck 
executor (eg. network issues) just to disprove this theory. For example, you 
may have 100k records with the same key on each side of a join. That probably 
won't really show up as extreme skew in task input data. But it'll become 10B 
join pairs that spark works through, in one task.

 

To further demonstrate the usefulness of this metric please follow the steps 
below.

 

_val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_

_val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_

 

_val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_

_val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_

 

_val df5 = df1.union(df2)_

_val df6 = df3.union(df4)_

 

_df5.createOrReplaceTempView("table1")_

_df6.createOrReplaceTempView("table2")_
h3. InnerJoin

_sql("select p.**,* f.* from table2 p join table1 f on f.b = p.b and f.c > 
p.c").count_

_number of output rows: 5,580,000_

_number of matched pairs: 90,000,490,000_
h3. FullOuterJoin

_spark.sql("select p.**,* f.* from table2 p full outer join table1 f on f.b = 
p.b and f.c > p.c").count_

_number of output rows: 6,099,964_

_number of matched pairs: 90,000,490,000_
h3. LeftOuterJoin

_sql("select p.**,* f.* from table2 p left outer join table1 f on f.b = p.b and 
f.c > p.c").count_

_number of output rows: 6,079,964_

_number of matched pairs: 90,000,490,000_
h3. RightOuterJoin

_spark.sql("select p.**,* f.* from table2 p right outer join table1 f on f.b = 
p.b and f.c > p.c").count_

_number of output rows: 5,600,000_

_number of matched pairs: 90,000,490,000_
h3. LeftSemiJoin

_spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and f.c 
> p.c").count_

_number of output rows: 36_

_number of matched pairs: 89,994,910,036_
h3. CrossJoin

_spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b and 
f.c > p.c").count_

_number of output rows: 5,580,000_

_number of matched pairs: 90,000,490,000_
h3. LeftAntiJoin

_spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
p.c").count_

number of output rows: 499,964

number of matched pairs: 89,994,910,036

  was:
Often users face a scenario where even a modest skew in a join can lead to 
tasks appearing to be stuck, due to the O(n^2) nature of a join considering all 
pairs of rows with matching keys. When this happens users think that spark has 
gotten deadlocked. If there is a bound condition, the "number of output rows" 
metric may look typical. Other metrics may look very modest (eg: shuffle read). 
In those cases, it is very hard to understand what the problem is. There is no 
conclusive proof without getting a heap dump and looking at some internal data 
structures.

It would be much better if spark had a metric(which we propose be titled 
“number of matched pairs” as a companion to “number of output rows”) which 
showed the user how many pairs were being processed in the join. This would get 
updated in the live UI (when metrics get collected during heartbeats), so the 
user could easily see what was going on.

This would even help in cases where there was some other cause of a stuck 
executor (eg. network issues) just to disprove this theory. For example, you 
may have 100k records with the same key on each side of a join. That probably 
won't really show up as extreme skew in task input data. But it'll become 10B 
join pairs that spark works through, in one task.

 

To further demonstrate the usefulness of this metric please follow the steps 
below.

 

_val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_

_val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_

[jira] [Commented] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Srinivas Rishindra Pothireddi (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279098#comment-17279098
 ] 

Srinivas Rishindra Pothireddi commented on SPARK-34369:
---

I am working on this

> Track number of pairs processed out of Join
> ---
>
> Key: SPARK-34369
> URL: https://issues.apache.org/jira/browse/SPARK-34369
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Affects Versions: 3.2.0
>Reporter: Srinivas Rishindra Pothireddi
>Priority: Major
>
> Often users face a scenario where even a modest skew in a join can lead to 
> tasks appearing to be stuck, due to the O(n^2) nature of a join considering 
> all pairs of rows with matching keys. When this happens users think that 
> spark has gotten deadlocked. If there is a bound condition, the "number of 
> output rows" metric may look typical. Other metrics may look very modest (eg: 
> shuffle read). In those cases, it is very hard to understand what the problem 
> is. There is no conclusive proof without getting a heap dump and looking at 
> some internal data structures.
> It would be much better if spark had a metric(which we propose be titled 
> “number of matched pairs” as a companion to “number of output rows”) which 
> showed the user how many pairs were being processed in the join. This would 
> get updated in the live UI (when metrics get collected during heartbeats), so 
> the user could easily see what was going on.
> This would even help in cases where there was some other cause of a stuck 
> executor (eg. network issues) just to disprove this theory. For example, you 
> may have 100k records with the same key on each side of a join. That probably 
> won't really show up as extreme skew in task input data. But it'll become 10B 
> join pairs that spark works through, in one task.
>  
> To further demonstrate the usefulness of this metric please follow the steps 
> below.
>  
> _val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", 
> "c")_
> _val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_
>  
> _val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_
> _val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_
>  
> _val df5 = df1.union(df2)_
> _val df6 = df3.union(df4)_
>  
> _df5.createOrReplaceTempView("table1")_
> _df6.createOrReplaceTempView("table2")_
> h3. InnerJoin
> _sql("select p.*, f.* from table2 p join table1 f on f.b = p.b and f.c > 
> p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. FullOuterJoin
> _spark.sql("select p.*, f.* from table2 p full outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 6,099,964_
> _number of matched pairs: 90,000,490,000_
> h3. LeftOuterJoin
> _sql("select p.*, f.* from table2 p left outer join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 6,079,964_
> _number of matched pairs: 90,000,490,000_
> h3. RightOuterJoin
> _spark.sql("select p.*, f.* from table2 p right outer join table1 f on f.b = 
> p.b and f.c > p.c").count_
> _number of output rows: 5,600,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftSemiJoin
> _spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and 
> f.c > p.c").count_
> _number of output rows: 36_
> _number of matched pairs: 89,994,910,036_
> h3. CrossJoin
> _spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b 
> and f.c > p.c").count_
> _number of output rows: 5,580,000_
> _number of matched pairs: 90,000,490,000_
> h3. LeftAntiJoin
> _spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
> p.c").count_
> number of output rows: 499,964
> number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34369) Track number of pairs processed out of Join

2021-02-04 Thread Srinivas Rishindra Pothireddi (Jira)

Srinivas Rishindra Pothireddi created SPARK-34369:
-

 Summary: Track number of pairs processed out of Join
 Key: SPARK-34369
 URL: https://issues.apache.org/jira/browse/SPARK-34369
 Project: Spark
  Issue Type: New Feature
  Components: Web UI
Affects Versions: 3.2.0
Reporter: Srinivas Rishindra Pothireddi


Often users face a scenario where even a modest skew in a join can lead to 
tasks appearing to be stuck, due to the O(n^2) nature of a join considering all 
pairs of rows with matching keys. When this happens users think that spark has 
gotten deadlocked. If there is a bound condition, the "number of output rows" 
metric may look typical. Other metrics may look very modest (eg: shuffle read). 
In those cases, it is very hard to understand what the problem is. There is no 
conclusive proof without getting a heap dump and looking at some internal data 
structures.

It would be much better if spark had a metric(which we propose be titled 
“number of matched pairs” as a companion to “number of output rows”) which 
showed the user how many pairs were being processed in the join. This would get 
updated in the live UI (when metrics get collected during heartbeats), so the 
user could easily see what was going on.

This would even help in cases where there was some other cause of a stuck 
executor (eg. network issues) just to disprove this theory. For example, you 
may have 100k records with the same key on each side of a join. That probably 
won't really show up as extreme skew in task input data. But it'll become 10B 
join pairs that spark works through, in one task.

 

To further demonstrate the usefulness of this metric please follow the steps 
below.

 

_val df1 = spark.range(0, 20).map \{ x => (x % 20, 20) }.toDF("b", "c")_

_val df2 = spark.range(0, 30).map \{ x => (77, 20) }.toDF("b", "c")_

 

_val df3 = spark.range(0, 20).map(x => (x + 1, x + 2)).toDF("b", "c")_

_val df4 = spark.range(0, 30).map(x => (77, x + 2)).toDF("b", "c")_

 

_val df5 = df1.union(df2)_

_val df6 = df3.union(df4)_

 

_df5.createOrReplaceTempView("table1")_

_df6.createOrReplaceTempView("table2")_
h3. InnerJoin

_sql("select p.*, f.* from table2 p join table1 f on f.b = p.b and f.c > 
p.c").count_

_number of output rows: 5,580,000_

_number of matched pairs: 90,000,490,000_
h3. FullOuterJoin

_spark.sql("select p.*, f.* from table2 p full outer join table1 f on f.b = p.b 
and f.c > p.c").count_

_number of output rows: 6,099,964_

_number of matched pairs: 90,000,490,000_
h3. LeftOuterJoin

_sql("select p.*, f.* from table2 p left outer join table1 f on f.b = p.b and 
f.c > p.c").count_

_number of output rows: 6,079,964_

_number of matched pairs: 90,000,490,000_
h3. RightOuterJoin

_spark.sql("select p.*, f.* from table2 p right outer join table1 f on f.b = 
p.b and f.c > p.c").count_

_number of output rows: 5,600,000_

_number of matched pairs: 90,000,490,000_
h3. LeftSemiJoin

_spark.sql("select * from table2 p left semi join table1 f on f.b = p.b and f.c 
> p.c").count_

_number of output rows: 36_

_number of matched pairs: 89,994,910,036_
h3. CrossJoin

_spark.sql("select p.*, f.* from table2 p cross join table1 f on f.b = p.b and 
f.c > p.c").count_

_number of output rows: 5,580,000_

_number of matched pairs: 90,000,490,000_
h3. LeftAntiJoin

_spark.sql("select * from table2 p anti join table1 f on f.b = p.b and f.c > 
p.c").count_

number of output rows: 499,964

number of matched pairs: 89,994,910,036



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34368) Streaming implementation for metrics from Datasource v2 scan

2021-02-04 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-34368:
---

 Summary: Streaming implementation for metrics from Datasource v2 
scan
 Key: SPARK-34368
 URL: https://issues.apache.org/jira/browse/SPARK-34368
 Project: Spark
  Issue Type: Sub-task
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Using metrics interface of DS v2 to report metrics for streaming scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34367) Batch implementation for metrics from Datasource v2 scan

2021-02-04 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34367:

Description: Using metrics interface of DS v2 to report metrics for batch 
scan.

> Batch implementation for metrics from Datasource v2 scan
> 
>
> Key: SPARK-34367
> URL: https://issues.apache.org/jira/browse/SPARK-34367
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Using metrics interface of DS v2 to report metrics for batch scan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34367) Batch implementation for metrics from Datasource v2 scan

2021-02-04 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-34367:
---

 Summary: Batch implementation for metrics from Datasource v2 scan
 Key: SPARK-34367
 URL: https://issues.apache.org/jira/browse/SPARK-34367
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34338) Report metrics from Datasource v2 scan

2021-02-04 Thread L. C. Hsieh (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

L. C. Hsieh updated SPARK-34338:

Issue Type: Umbrella  (was: Improvement)

> Report metrics from Datasource v2 scan
> --
>
> Key: SPARK-34338
> URL: https://issues.apache.org/jira/browse/SPARK-34338
> Project: Spark
>  Issue Type: Umbrella
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> This is related to SPARK-34297.
> In SPARK-34297, we want to add a couple of useful metrics when reading from 
> Kafka in SS. We need some public API change in DS v2 to make it possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34366) Add metric interfaces to DS v2

2021-02-04 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-34366:
---

 Summary: Add metric interfaces to DS v2
 Key: SPARK-34366
 URL: https://issues.apache.org/jira/browse/SPARK-34366
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Add a few public API change to DS v2, to make DS v2 scan can report metrics to 
Spark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-04 Thread Erik Krogen (Jira)

Erik Krogen created SPARK-34365:
---

 Summary: Support configurable Avro schema field matching for 
positional or by-name
 Key: SPARK-34365
 URL: https://issues.apache.org/jira/browse/SPARK-34365
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.1
Reporter: Erik Krogen


When reading an Avro dataset (using the dataset's schema or by overriding it 
with 'avroSchema') or writing an Avro dataset with a provided schema by 
'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
field name.

This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
least on the write path, we would match the schemas by positionally 
("structural" comparison). While I agree that this is much more sensible for 
default behavior, I propose that we make this behavior configurable using an 
{{option}} for the Avro datasource.

There is precedence for configurability of this behavior as seen in 
SPARK-32864, which added this support for ORC. Besides this precedence, the 
behavior of Hive is to perform matching positionally 
([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
 so this is behavior that Hadoop/Hive ecosystem users are familiar with:
{quote}
Hive is very forgiving about types: it will attempt to store whatever value 
matches the provided column in the equivalent column position in the new table. 
No matching is done on column names, for instance.
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34365) Support configurable Avro schema field matching for positional or by-name

2021-02-04 Thread Erik Krogen (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279073#comment-17279073
 ] 

Erik Krogen commented on SPARK-34365:
-

I plan to post a PR for this in the next few days, unless I hear pushback that 
this is a bad idea.

> Support configurable Avro schema field matching for positional or by-name
> -
>
> Key: SPARK-34365
> URL: https://issues.apache.org/jira/browse/SPARK-34365
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: Erik Krogen
>Priority: Major
>
> When reading an Avro dataset (using the dataset's schema or by overriding it 
> with 'avroSchema') or writing an Avro dataset with a provided schema by 
> 'avroSchema', currently the matching of Catalyst-to-Avro fields is done by 
> field name.
> This behavior is somewhat recent; prior to SPARK-27762 (fixed in 3.0.0), at 
> least on the write path, we would match the schemas by positionally 
> ("structural" comparison). While I agree that this is much more sensible for 
> default behavior, I propose that we make this behavior configurable using an 
> {{option}} for the Avro datasource.
> There is precedence for configurability of this behavior as seen in 
> SPARK-32864, which added this support for ORC. Besides this precedence, the 
> behavior of Hive is to perform matching positionally 
> ([ref|https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles]),
>  so this is behavior that Hadoop/Hive ecosystem users are familiar with:
> {quote}
> Hive is very forgiving about types: it will attempt to store whatever value 
> matches the provided column in the equivalent column position in the new 
> table. No matching is done on column names, for instance.
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34033) SparkR Daemon Initialization

2021-02-04 Thread Tom Howland (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tom Howland updated SPARK-34033:

Description: 
Provide a way for users to initialize the sparkR daemon before it forks.

I'm a contractor to Target, where we have several projects doing ML with 
sparkR. The changes proposed here result in weeks of compute-time saved with 
every run.

Please see 
[docs/sparkr.md#daemon-initialization|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization].

 

  was:
Provide a way for users to initialize the sparkR daemon before it forks.

I'm a contractor to Target, where we have several projects doing ML with 
sparkR. The changes proposed here results in weeks of compute-time saved with 
every run.

(4 partitions) * (5 seconds to load our R libraries) * (2 calls to gapply 
in our app) / 60 / 60 = 111 hours.

(from 
[docs/sparkr.md|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization])
h3. Daemon Initialization

If your worker function has a lengthy initialization, and your
 application has lots of partitions, you may find you are spending weeks
 of compute time repeatedly doing something that should have taken a few
 seconds during daemon initialization.

Every Spark executor spawns a process running an R daemon. The daemon
 "forks a copy" of itself whenever Spark finds work for it to do. It may
 be applying a predefined method such as "max", or it may be applying
 your worker function. SparkR::gapply arranges things so that your worker
 function will be called with each group. A group is the pair
 Key-Seq[Row]. In the absence of partitioning, the daemon will fork for
 every group found. With partitioning, the daemon will fork for every
 partition found. A partition may have several groups in it.

All the initializations and library loading your worker function manages
 is thrown away when the fork concludes. Every fork has to be
 initialized.

The configuration spark.r.daemonInit provides a way to avoid reloading
 packages every time the daemon forks by having the daemon pre-load
 packages. You do this by providing R code to initialize the daemon for
 your application.
h4. Examples

Suppose we want library(wow) to be pre-loaded for our workers.

{{sparkR.session(spark.r.daemonInit = 'library(wow)')}}

of course, that would only work if we knew that library(wow) was on our
 path and available on the executor. If we have to ship the library, we
 can use YARN

sparkR.session(
   master = 'yarn',
   spark.r.daemonInit = '.libPaths(c("wowTarget", .libPaths())); 
library(wow)',
   spark.submit.deployMode = 'client',
   spark.yarn.dist.archives = 'wow.zip#wowTarget')

YARN creates a directory for the new executor, unzips 'wow.zip' in some
 other directory, and then provides a symlink to it called
 ./wowTarget. When the executor starts the daemon, the daemon loads
 library(wow) from the newly created wowTarget.

Warning: if your initialization takes longer than 10 seconds, consider
 increasing the configuration [spark.r.daemonTimeout](configuration.md#sparkr).


> SparkR Daemon Initialization
> 
>
> Key: SPARK-34033
> URL: https://issues.apache.org/jira/browse/SPARK-34033
> Project: Spark
>  Issue Type: Improvement
>  Components: R, SparkR
>Affects Versions: 3.2.0
> Environment: tested on centos 7 & spark 2.3.1 and on my mac & spark 
> at master
>Reporter: Tom Howland
>Priority: Major
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Provide a way for users to initialize the sparkR daemon before it forks.
> I'm a contractor to Target, where we have several projects doing ML with 
> sparkR. The changes proposed here result in weeks of compute-time saved with 
> every run.
> Please see 
> [docs/sparkr.md#daemon-initialization|https://github.com/WamBamBoozle/spark/blob/daemon_init/docs/sparkr.md#daemon-initialization].
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34364) Monitor disk usage and use to reject blocks when under disk pressure

2021-02-04 Thread Holden Karau (Jira)

Holden Karau created SPARK-34364:


 Summary: Monitor disk usage and use to reject blocks when under 
disk pressure
 Key: SPARK-34364
 URL: https://issues.apache.org/jira/browse/SPARK-34364
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Holden Karau


Has some limitations when combined with emptyDir on K8s.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34363) Allow users to configure a maximum amount of remote shuffle block storage

2021-02-04 Thread Holden Karau (Jira)

Holden Karau created SPARK-34363:


 Summary: Allow users to configure a maximum amount of remote 
shuffle block storage
 Key: SPARK-34363
 URL: https://issues.apache.org/jira/browse/SPARK-34363
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-04 Thread cornel creanga (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279006#comment-17279006
 ] 

cornel creanga commented on SPARK-34298:


Thanks for the answer. In this wouldn't be better to implement the option a) - 
throw an explicit error with a meaningful message(eg root dirs are not 
supported in overwrite mode etc) when trying to use a root dir? Right now one 
will get an java.lang.IndexOutOfBoundsException and will have to dig into the 
Spark code in order to understand what's the problem (as it happened to me).

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delete the entries one by one.{color}{color}
> I can provide a patch for both choices, assuming that they make sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-34337) Reject disk blocks when out of disk space

2021-02-04 Thread Holden Karau (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279000#comment-17279000
 ] 

Holden Karau commented on SPARK-34337:
--

Initially we should allow the user to configure a maximum amount of shuffle 
blocks to be stored. In the future we can try and use underlying FS info.

> Reject disk blocks when out of disk space
> -
>
> Key: SPARK-34337
> URL: https://issues.apache.org/jira/browse/SPARK-34337
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.1, 3.1.2
>Reporter: Holden Karau
>Priority: Major
>
> Now that we have the ability to store shuffle blocks on dis-aggregated 
> storage (when configured) we should add the option to reject storing blocks 
> locally on an executor at a certain disk pressure threshold.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-33888) JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-33888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-33888:
---

Assignee: Duc Hoa Nguyen  (was: Apache Spark)

> JDBC SQL TIME type represents incorrectly as TimestampType, it should be 
> physical Int in millis
> ---
>
> Key: SPARK-33888
> URL: https://issues.apache.org/jira/browse/SPARK-33888
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.3, 3.0.0, 3.0.1
>Reporter: Duc Hoa Nguyen
>Assignee: Duc Hoa Nguyen
>Priority: Minor
> Fix For: 3.2.0
>
>
> Currently, for JDBC, SQL TIME type represents incorrectly as Spark 
> TimestampType. This should be represent as physical int in millis Represents 
> a time of day, with no reference to a particular calendar, time zone or date, 
> with a precision of one millisecond. It stores the number of milliseconds 
> after midnight, 00:00:00.000.
> We encountered the issue of Avro logical type of `TimeMillis` not being 
> converted correctly to Spark `Timestamp` struct type using the 
> `SchemaConverters`, but it converts to regular `int` instead. Reproducible by 
> ingest data from MySQL table with a column of TIME type: Spark JDBC dataframe 
> will get the correct type (Timestamp), but enforcing our avro schema 
> (`{"type": "int"," logicalType": "time-millis"}`) externally will fail to 
> apply with the following exception:
> {{java.lang.RuntimeException: java.sql.Timestamp is not a valid external type 
> for schema of int}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-34357:
---

Assignee: Duc Hoa Nguyen

> Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of 
> timezone
> --
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Assignee: Duc Hoa Nguyen
>Priority: Minor
> Fix For: 3.2.0
>
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34357.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31473
[https://github.com/apache/spark/pull/31473]

> Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of 
> timezone
> --
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Priority: Minor
> Fix For: 3.2.0
>
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4

2021-02-04 Thread Svitlana Ponomarova (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278954#comment-17278954
 ] 

Svitlana Ponomarova commented on SPARK-34362:
-

According to Google Dataproc 1.4 documentation it supports Apache Hive 2.3.7:
[https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]

As I see from sources in *branch-2.4*:
[https://github.com/apache/spark/blob/e7acca22cd1ed9a70fabc9ca143aa06fa8573864/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L101]

case "2.3" doesn't have any match to 2.3.7 version
which causes MatchError.

May this part be back-ported from *branch-3.0 ?*
https://github.com/apache/spark/blob/06942331a7db1e6d5e6709ac7009c180c94cc7c0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala#L107

> scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 
> 1.4
> -
>
> Key: SPARK-34362
> URL: https://issues.apache.org/jira/browse/SPARK-34362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Svitlana Ponomarova
>Priority: Critical
>
> According to Google Dataproc 1.4 release notes:
>  
> [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
>  2.4.7 is supported spark version.
> Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes:
> {noformat}
> scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
> (of class java.lang.String) 
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
>  
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300)
>  at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
>  
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4

2021-02-04 Thread Svitlana Ponomarova (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278938#comment-17278938
 ] 

Svitlana Ponomarova edited comment on SPARK-34362 at 2/4/21, 4:33 PM:
--

Maybe "Components" field is wrong. Unfortunately, I don't see "Hive" in a list. 


was (Author: sponomarova):
Maybe "Components" field was filled wrong. Unfortunately, I don't see "Hive" in 
a list. 

> scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 
> 1.4
> -
>
> Key: SPARK-34362
> URL: https://issues.apache.org/jira/browse/SPARK-34362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Svitlana Ponomarova
>Priority: Critical
>
> According to Google Dataproc 1.4 release notes:
>  
> [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
>  2.4.7 is supported spark version.
> Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes:
> {noformat}
> scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
> (of class java.lang.String) 
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
>  
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300)
>  at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
>  
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4

2021-02-04 Thread Svitlana Ponomarova (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Svitlana Ponomarova updated SPARK-34362:

Description: 
According to Google Dataproc 1.4 release notes:
 
[https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
 2.4.7 is supported spark version.

Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes:
{noformat}
scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
(of class java.lang.String) 
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
{noformat}

  was:
According to Google Dataproc 1.4 release notes:
 
[https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
 2.4.7 is supported spark version.

Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes:
{noformat}
scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
(of class java.lang.String) 
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
{noformat}


> scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 
> 1.4
> -
>
> Key: SPARK-34362
> URL: https://issues.apache.org/jira/browse/SPARK-34362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Svitlana Ponomarova
>Priority: Critical
>
> According to Google Dataproc 1.4 release notes:
>  
> [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
>  2.4.7 is supported spark version.
> Use *spark-hive_2.11-2.4.7.jar* for Hive on Dataproc 1.4 causes:
> {noformat}
> scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
> (of class java.lang.String) 
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
>  
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300)
>  at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
>  
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4

2021-02-04 Thread Svitlana Ponomarova (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278938#comment-17278938
 ] 

Svitlana Ponomarova commented on SPARK-34362:
-

Maybe "Components" field was filled wrong. Unfortunately, I don't see "Hive" in 
a list. 

> scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 
> 1.4
> -
>
> Key: SPARK-34362
> URL: https://issues.apache.org/jira/browse/SPARK-34362
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.7
>Reporter: Svitlana Ponomarova
>Priority: Critical
>
> According to Google Dataproc 1.4 release notes:
>  
> [https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
>  2.4.7 is supported spark version.
> Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes:
> {noformat}
> scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
> (of class java.lang.String) 
> at 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
>  
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300)
>  at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287)
>  at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
>  
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34362) scala.MatchError: 2.3 (of class java.lang.String) for Hive on Google Dataproc 1.4

2021-02-04 Thread Svitlana Ponomarova (Jira)

Svitlana Ponomarova created SPARK-34362:
---

 Summary: scala.MatchError: 2.3 (of class java.lang.String) for 
Hive on Google Dataproc 1.4
 Key: SPARK-34362
 URL: https://issues.apache.org/jira/browse/SPARK-34362
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.7
Reporter: Svitlana Ponomarova


According to Google Dataproc 1.4 release notes:
 
[https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-release-1.4]
 2.4.7 is supported spark version.

Use spark-hive_2.11-2.4.7.jar for Hive on Dataproc 1.4 causes:
{noformat}
scala.MatchError: 2.3 (of class java.lang.String) scala.MatchError: 2.3 
(of class java.lang.String) 
at 
org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:89)
 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:300) 
at 
org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:287) 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
 
at 
org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks

2021-02-04 Thread Attila Zsolt Piros (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278932#comment-17278932
 ] 

Attila Zsolt Piros commented on SPARK-34361:


I am working on this.

> Dynamic allocation on K8s kills executors with running tasks
> 
>
> Key: SPARK-34361
> URL: https://issues.apache.org/jira/browse/SPARK-34361
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
>Reporter: Attila Zsolt Piros
>Priority: Major
>
> There is race between executor POD allocator and cluster scheduler backend. 
> During downscaling (in dynamic allocation) we experienced a lot of killed new 
> executors with running task on them.
> The pattern in the log is the following:
> {noformat}
> 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new 
> total is 138)
> ...
> 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 
> 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
> 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests 
> (408,312,307).
> ...
> 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 
> 100.100.18.138: The executor with id 312 was deleted by a user or the 
> framework.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34361) Dynamic allocation on K8s kills executors with running tasks

2021-02-04 Thread Attila Zsolt Piros (Jira)

Attila Zsolt Piros created SPARK-34361:
--

 Summary: Dynamic allocation on K8s kills executors with running 
tasks
 Key: SPARK-34361
 URL: https://issues.apache.org/jira/browse/SPARK-34361
 Project: Spark
  Issue Type: Bug
  Components: Kubernetes
Affects Versions: 3.0.1, 3.0.0, 3.0.2, 3.1.0, 3.2.0, 3.1.1, 3.1.2
Reporter: Attila Zsolt Piros


There is race between executor POD allocator and cluster scheduler backend. 
During downscaling (in dynamic allocation) we experienced a lot of killed new 
executors with running task on them.

The pattern in the log is the following:

{noformat}
21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new 
total is 138)
...
21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 
2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests 
(408,312,307).
...
21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: 
The executor with id 312 was deleted by a user or the framework.
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278878#comment-17278878
 ] 

Apache Spark commented on SPARK-34360:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31475

> Support table truncation by v2 Table Catalogs
> -
>
> Key: SPARK-34360
> URL: https://issues.apache.org/jira/browse/SPARK-34360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new method `truncateTable` to the TableCatalog interface with default 
> implementation. And implement this method in InMemoryTableCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34360:


Assignee: Apache Spark  (was: Maxim Gekk)

> Support table truncation by v2 Table Catalogs
> -
>
> Key: SPARK-34360
> URL: https://issues.apache.org/jira/browse/SPARK-34360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Apache Spark
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new method `truncateTable` to the TableCatalog interface with default 
> implementation. And implement this method in InMemoryTableCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34360:


Assignee: Maxim Gekk  (was: Apache Spark)

> Support table truncation by v2 Table Catalogs
> -
>
> Key: SPARK-34360
> URL: https://issues.apache.org/jira/browse/SPARK-34360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new method `truncateTable` to the TableCatalog interface with default 
> implementation. And implement this method in InMemoryTableCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278877#comment-17278877
 ] 

Apache Spark commented on SPARK-34360:
--

User 'MaxGekk' has created a pull request for this issue:
https://github.com/apache/spark/pull/31475

> Support table truncation by v2 Table Catalogs
> -
>
> Key: SPARK-34360
> URL: https://issues.apache.org/jira/browse/SPARK-34360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new method `truncateTable` to the TableCatalog interface with default 
> implementation. And implement this method in InMemoryTableCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Maxim Gekk (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Maxim Gekk updated SPARK-34360:
---
Description: Add new method `truncateTable` to the TableCatalog interface 
with default implementation. And implement this method in InMemoryTableCatalog. 
 (was: Add new method `truncatePartition` in `SupportsPartitionManagement` and 
`truncatePartitions` in `SupportsAtomicPartitionManagement`.)

> Support table truncation by v2 Table Catalogs
> -
>
> Key: SPARK-34360
> URL: https://issues.apache.org/jira/browse/SPARK-34360
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Maxim Gekk
>Assignee: Maxim Gekk
>Priority: Major
> Fix For: 3.2.0
>
>
> Add new method `truncateTable` to the TableCatalog interface with default 
> implementation. And implement this method in InMemoryTableCatalog.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34360) Support table truncation by v2 Table Catalogs

2021-02-04 Thread Maxim Gekk (Jira)

Maxim Gekk created SPARK-34360:
--

 Summary: Support table truncation by v2 Table Catalogs
 Key: SPARK-34360
 URL: https://issues.apache.org/jira/browse/SPARK-34360
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.2.0
Reporter: Maxim Gekk
Assignee: Maxim Gekk
 Fix For: 3.2.0


Add new method `truncatePartition` in `SupportsPartitionManagement` and 
`truncatePartitions` in `SupportsAtomicPartitionManagement`.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34359:


Assignee: Wenchen Fan  (was: Apache Spark)

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278863#comment-17278863
 ] 

Apache Spark commented on SPARK-34359:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31474

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34359:


Assignee: Apache Spark  (was: Wenchen Fan)

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Wenchen Fan (Jira)

Wenchen Fan created SPARK-34359:
---

 Summary: add a legacy config to restore the output schema of SHOW 
DATABASES
 Key: SPARK-34359
 URL: https://issues.apache.org/jira/browse/SPARK-34359
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.2
Reporter: Wenchen Fan
Assignee: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34359) add a legacy config to restore the output schema of SHOW DATABASES

2021-02-04 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-34359:

Issue Type: Improvement  (was: Bug)

> add a legacy config to restore the output schema of SHOW DATABASES
> --
>
> Key: SPARK-34359
> URL: https://issues.apache.org/jira/browse/SPARK-34359
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.2
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-04 Thread Steve Loughran (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278846#comment-17278846
 ] 

Steve Loughran commented on SPARK-34298:


root dirs are special in that they always exist. Normally apps like spark and 
hive don't notice this as nobody ever runs jobs which write to the base of 
file:// or hdfs:// ; object stores are special there.

You might the commit algorithms get a bit confused too. In which case: 

* fixes for the s3a committers welcome;
* there is a serialized test phase in hadoop-aws where we do stuff against the 
root dir; 
* and a PR for real integration tests can go in 
https://github.com/hortonworks-spark/cloud-integration
* anything related to the classic MR committer will be rejected out of fear of 
going near it; not safe for s3 anyway

Otherwise: workaround is to write into a subdir. Sorry

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delet

[jira] [Created] (SPARK-34358) Add API for all built-in expresssion functions

2021-02-04 Thread Malthe Borch (Jira)

Malthe Borch created SPARK-34358:


 Summary: Add API for all built-in expresssion functions
 Key: SPARK-34358
 URL: https://issues.apache.org/jira/browse/SPARK-34358
 Project: Spark
  Issue Type: Improvement
  Components: Java API, PySpark
Affects Versions: 3.0.1
Reporter: Malthe Borch


>From the [SQL 
>functions|https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html]
> documentation:

{quote}Commonly used functions available for DataFrame operations. Using 
functions defined here provides a little bit more compile-time safety to make 
sure the function exists. {quote}

Functions such as "inline_outer" are actually commonly used, but are not 
currently included in the API, meaning that we lose compile-time safety for 
those invocations. We should implement the required function definitions for 
the remaining built-in functions when applicable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34357) Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Duc Hoa Nguyen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duc Hoa Nguyen updated SPARK-34357:
---
Summary: Map JDBC SQL TIME type to TimestampType with time portion fixed 
regardless of timezone  (was: Revert JDBC SQL TIME type to TimestampType with 
time portion fixed regardless of timezone)

> Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of 
> timezone
> --
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Priority: Minor
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-34351) Running into "Py4JJavaError" while counting to text file or list using Pyspark, Jupyter notebook

2021-02-04 Thread Jacek Laskowski (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacek Laskowski resolved SPARK-34351.
-
Resolution: Invalid

Please use StackOverflow or the user@spark.a.o mailing list to ask this 
question (as described in [http://spark.apache.org/community.html]. See you 
there!

> Running into "Py4JJavaError" while counting to text file or list using 
> Pyspark, Jupyter notebook
> 
>
> Key: SPARK-34351
> URL: https://issues.apache.org/jira/browse/SPARK-34351
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.1
> Environment: PS> python --version
>  *Python 3.6.8*
> PS> jupyter --version
>  j*upyter core : 4.7.0*
>  *jupyter-notebook : 6.2.0*
>  qtconsole : 5.0.2
>  ipython : 7.16.1
>  ipykernel : 5.4.3
>  jupyter client : 6.1.11
>  jupyter lab : not installed
>  nbconvert : 6.0.7
>  ipywidgets : 7.6.3
>  nbformat : 5.1.2
>  traitlets : 4.3.3
> PS > java -version
>  *java version "1.8.0_271"*
>  Java(TM) SE Runtime Environment (build 1.8.0_271-b09)
>  Java HotSpot(TM) 64-Bit Server VM (build 25.271-b09, mixed mode)
>  
> Spark versiyon
> *spark-2.3.1-bin-hadoop2.7*
>Reporter: Huseyin Elci
>Priority: Major
>
> I run into the following error: 
>  Any help resolving this error is greatly appreciated.
>  *My Code 1:*
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark.sql import SparkSession
> from pyspark.conf import SparkConf
> spark = SparkSession.builder\
> .master("local[4]")\
> .appName("WordCount_RDD")\
> .getOrCreate()
> sc = spark.sparkContext
> data = "D:\\05 Spark\\data\\MyArticle.txt"
> story_rdd = sc.textFile(data)
> story_rdd.count()
> {code}
> *My Code 2:* 
> {code:python}
> import findspark
> findspark.init("C:\Spark")
> from pyspark import SparkContext
> sc = SparkContext()
> mylist = [1,2,2,3,5,48,98,62,14,55]
> mylist_rdd = sc.parallelize(mylist)
> mylist_rdd.map(lambda x: x*x)
> mylist_rdd.map(lambda x: x*x).collect()
> {code}
> *ERROR:*
> I took same error code for my codes.
> {code:python}
>  ---
>  Py4JJavaError Traceback (most recent call last)
>   in 
>  > 1 story_rdd.count()
> C:\Spark\python\pyspark\rdd.py in count(self)
>  1071 3
>  1072 """
>  -> 1073 return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
>  1074 
>  1075 def stats(self):
> C:\Spark\python\pyspark\rdd.py in sum(self)
>  1062 6.0
>  1063 """
>  -> 1064 return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
>  1065 
>  1066 def count(self):
> C:\Spark\python\pyspark\rdd.py in fold(self, zeroValue, op)
>  933 # zeroValue provided to each partition is unique from the one provided
>  934 # to the final reduce call
>  --> 935 vals = self.mapPartitions(func).collect()
>  936 return reduce(op, vals, zeroValue)
>  937
> C:\Spark\python\pyspark\rdd.py in collect(self)
>  832 """
>  833 with SCCallSiteSync(self.context) as css:
>  --> 834 sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
>  835 return list(_load_from_socket(sock_info, self._jrdd_deserializer))
>  836
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\java_gateway.py in 
> __call__(self, *args)
>  1255 answer = self.gateway_client.send_command(command)
>  1256 return_value = get_return_value(
>  -> 1257 answer, self.gateway_client, self.target_id, self.name)
>  1258 
>  1259 for temp_arg in temp_args:
> C:\Spark\python\pyspark\sql\utils.py in deco(*a, **kw)
>  61 def deco(*a, **kw):
>  62 try:
>  ---> 63 return f(*a, **kw)
>  64 except py4j.protocol.Py4JJavaError as e:
>  65 s = e.java_exception.toString()
> C:\Spark\python\lib\py4j-0.10.7-src.zip\py4j\protocol.py in 
> get_return_value(answer, gateway_client, target_id, name)
>  326 raise Py4JJavaError(
>  327 "An error occurred while calling
> {0} \{1} \{2}
> .\n".
>  --> 328 format(target_id, ".", name), value)
>  329 else:
>  330 raise Py4JError(
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
>  : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 
> in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 
> (TID 1, localhost, executor driver): org.apache.spark.SparkException: Python 
> worker failed to connect back.
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:148)
>  at 
> org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:76)
>  at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117)
>  at 
> org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:86)
>  at org.apache.spark.api.python.PythonRDD.comput

[jira] [Commented] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278776#comment-17278776
 ] 

Apache Spark commented on SPARK-34357:
--

User 'saikocat' has created a pull request for this issue:
https://github.com/apache/spark/pull/31473

> Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless 
> of timezone
> -
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Priority: Minor
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278775#comment-17278775
 ] 

Apache Spark commented on SPARK-34357:
--

User 'saikocat' has created a pull request for this issue:
https://github.com/apache/spark/pull/31473

> Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless 
> of timezone
> -
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Priority: Minor
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34357:


Assignee: (was: Apache Spark)

> Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless 
> of timezone
> -
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Priority: Minor
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34357:


Assignee: Apache Spark

> Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless 
> of timezone
> -
>
> Key: SPARK-34357
> URL: https://issues.apache.org/jira/browse/SPARK-34357
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Duc Hoa Nguyen
>Assignee: Apache Spark
>Priority: Minor
>
> Due to user-experience (confusing to Spark users - java.sql.Time using 
> milliseconds vs Spark using microseconds; and user losing useful functions 
> like hour(), minute(), etc on the column), we have decided to revert back to 
> use TimestampType but this time we will enforce the hour to be consistently 
> across system timezone (via offset manipulation)
> Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
> https://github.com/apache/spark/pull/30902#discussion_r569186823
> Related issues: 
> [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34357) Revert JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone

2021-02-04 Thread Duc Hoa Nguyen (Jira)

Duc Hoa Nguyen created SPARK-34357:
--

 Summary: Revert JDBC SQL TIME type to TimestampType with time 
portion fixed regardless of timezone
 Key: SPARK-34357
 URL: https://issues.apache.org/jira/browse/SPARK-34357
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Duc Hoa Nguyen


Due to user-experience (confusing to Spark users - java.sql.Time using 
milliseconds vs Spark using microseconds; and user losing useful functions like 
hour(), minute(), etc on the column), we have decided to revert back to use 
TimestampType but this time we will enforce the hour to be consistently across 
system timezone (via offset manipulation)

Full Discussion with Wenchen Fan [~cloud_fan] regarding this ticket is here 
https://github.com/apache/spark/pull/30902#discussion_r569186823

Related issues: [SPARK-33888|https://issues.apache.org/jira/browse/SPARK-33888] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34356) OVR transform fix potential column conflict

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278750#comment-17278750
 ] 

Apache Spark commented on SPARK-34356:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31472

> OVR transform fix potential column conflict
> ---
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34356) OVR transform fix potential column conflict

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34356:


Assignee: Apache Spark  (was: zhengruifeng)

> OVR transform fix potential column conflict
> ---
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: Apache Spark
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34356) OVR transform fix potential column conflict

2021-02-04 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34356:


Assignee: zhengruifeng  (was: Apache Spark)

> OVR transform fix potential column conflict
> ---
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34356) OVR transform fix potential column conflict

2021-02-04 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278749#comment-17278749
 ] 

Apache Spark commented on SPARK-34356:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/31472

> OVR transform fix potential column conflict
> ---
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-34356) OVR transform fix potential column conflict

2021-02-04 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-34356:
-
Summary: OVR transform fix potential column conflict  (was: OVR transform 
avoid potential column conflict)

> OVR transform fix potential column conflict
> ---
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-34356) OVR transform avoid potential column conflict

2021-02-04 Thread zhengruifeng (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-34356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng reassigned SPARK-34356:


Assignee: zhengruifeng

> OVR transform avoid potential column conflict
> -
>
> Key: SPARK-34356
> URL: https://issues.apache.org/jira/browse/SPARK-34356
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.2.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
>
> {code:java}
> import org.apache.spark.ml.classification._val df = 
> spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
>  lit(0.0))val classifier = new 
> LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
> val ovr = new OneVsRest().setClassifier(classifier)
> val ovrm = ovr.fit(df)
> ovrm.transform(df)
> java.lang.IllegalArgumentException: requirement failed: Column probability 
> already exists.
>   at scala.Predef$.require(Predef.scala:281)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
>   at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
>   at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
>   at 
> org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
>   at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
>   at 
> org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
>   at 
> scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
>   at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
>   at 
> org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
>   ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-34356) OVR transform avoid potential column conflict

2021-02-04 Thread zhengruifeng (Jira)

zhengruifeng created SPARK-34356:


 Summary: OVR transform avoid potential column conflict
 Key: SPARK-34356
 URL: https://issues.apache.org/jira/browse/SPARK-34356
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 3.2.0
Reporter: zhengruifeng


{code:java}
import org.apache.spark.ml.classification._val df = 
spark.read.format("libsvm").load("/d0/Dev/Opensource/spark/data/mllib/sample_multiclass_classification_data.txt").withColumn("probability",
 lit(0.0))val classifier = new 
LogisticRegression().setMaxIter(1).setTol(1E-6).setFitIntercept(true)
val ovr = new OneVsRest().setClassifier(classifier)
val ovrm = ovr.fit(df)
ovrm.transform(df)
java.lang.IllegalArgumentException: requirement failed: Column probability 
already exists.
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:106)
  at org.apache.spark.ml.util.SchemaUtils$.appendColumn(SchemaUtils.scala:96)
  at 
org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema(ProbabilisticClassifier.scala:38)
  at 
org.apache.spark.ml.classification.ProbabilisticClassifierParams.validateAndTransformSchema$(ProbabilisticClassifier.scala:33)
  at 
org.apache.spark.ml.classification.LogisticRegressionModel.org$apache$spark$ml$classification$LogisticRegressionParams$$super$validateAndTransformSchema(LogisticRegression.scala:917)
  at 
org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema(LogisticRegression.scala:268)
  at 
org.apache.spark.ml.classification.LogisticRegressionParams.validateAndTransformSchema$(LogisticRegression.scala:255)
  at 
org.apache.spark.ml.classification.LogisticRegressionModel.validateAndTransformSchema(LogisticRegression.scala:917)
  at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:222)
  at 
org.apache.spark.ml.classification.ClassificationModel.transformSchema(Classifier.scala:182)
  at 
org.apache.spark.ml.classification.ProbabilisticClassificationModel.transformSchema(ProbabilisticClassifier.scala:88)
  at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:71)
  at 
org.apache.spark.ml.classification.ProbabilisticClassificationModel.transform(ProbabilisticClassifier.scala:107)
  at 
org.apache.spark.ml.classification.OneVsRestModel.$anonfun$transform$4(OneVsRest.scala:215)
  at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
  at 
scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
  at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:198)
  at 
org.apache.spark.ml.classification.OneVsRestModel.transform(OneVsRest.scala:203)
  ... 49 elided {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

1 2 >

1 - 100 of 114 matches

Mail list logo