[jira] [Created] (SPARK-34535) Cleanup unused symbol in Orc related code

2021-02-24 Thread Yang Jie (Jira)
Yang Jie created SPARK-34535:


 Summary: Cleanup unused symbol in Orc related code
 Key: SPARK-34535
 URL: https://issues.apache.org/jira/browse/SPARK-34535
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Yang Jie


Cleanup unused symbol in Orc related code includes `OrcDeserializer`, 
`OrcFilters` and `OrcPartitionReaderFactory`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290744#comment-17290744
 ] 

Apache Spark commented on SPARK-34534:
--

User 'seayoun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31643

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
> feature, this introduce additional problems as follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34534:


Assignee: (was: Apache Spark)

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
> feature, this introduce additional problems as follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34534:


Assignee: Apache Spark

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Assignee: Apache Spark
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
> feature, this introduce additional problems as follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290743#comment-17290743
 ] 

Apache Spark commented on SPARK-34534:
--

User 'seayoun' has created a pull request for this issue:
https://github.com/apache/spark/pull/31643

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
> feature, this introduce additional problems as follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290726#comment-17290726
 ] 

Apache Spark commented on SPARK-33212:
--

User 'sunchao' has created a pull request for this issue:
https://github.com/apache/spark/pull/31642

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290707#comment-17290707
 ] 

Chao Sun commented on SPARK-33212:
--

Yes. I think the only class Spark needs from this jar is 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together 
with other two classes it depends on from the same package, do not have Guava 
dependency except {{VisibleForTesting}}.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290676#comment-17290676
 ] 

Apache Spark commented on SPARK-34533:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/31641

> Eliminate LEFT ANTI join to empty relation in AQE
> -
>
> Key: SPARK-34533
> URL: https://issues.apache.org/jira/browse/SPARK-34533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> I discovered from review discussion - 
> [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we 
> can eliminate LEFT ANTI join (with no join condition) to empty relation, if 
> the right side is known to be non-empty. So with AQE, this is doable similar 
> to [https://github.com/apache/spark/pull/29484] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290675#comment-17290675
 ] 

Apache Spark commented on SPARK-34533:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/31641

> Eliminate LEFT ANTI join to empty relation in AQE
> -
>
> Key: SPARK-34533
> URL: https://issues.apache.org/jira/browse/SPARK-34533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> I discovered from review discussion - 
> [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we 
> can eliminate LEFT ANTI join (with no join condition) to empty relation, if 
> the right side is known to be non-empty. So with AQE, this is doable similar 
> to [https://github.com/apache/spark/pull/29484] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34533:


Assignee: Apache Spark

> Eliminate LEFT ANTI join to empty relation in AQE
> -
>
> Key: SPARK-34533
> URL: https://issues.apache.org/jira/browse/SPARK-34533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Assignee: Apache Spark
>Priority: Minor
>
> I discovered from review discussion - 
> [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we 
> can eliminate LEFT ANTI join (with no join condition) to empty relation, if 
> the right side is known to be non-empty. So with AQE, this is doable similar 
> to [https://github.com/apache/spark/pull/29484] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34533:


Assignee: (was: Apache Spark)

> Eliminate LEFT ANTI join to empty relation in AQE
> -
>
> Key: SPARK-34533
> URL: https://issues.apache.org/jira/browse/SPARK-34533
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Cheng Su
>Priority: Minor
>
> I discovered from review discussion - 
> [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we 
> can eliminate LEFT ANTI join (with no join condition) to empty relation, if 
> the right side is known to be non-empty. So with AQE, this is doable similar 
> to [https://github.com/apache/spark/pull/29484] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34520) Remove unused SecurityManager references

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34520:
-

Assignee: Hyukjin Kwon

> Remove unused SecurityManager references
> 
>
> Key: SPARK-34520
> URL: https://issues.apache.org/jira/browse/SPARK-34520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> Many places of SecurityManager is not used anymore. Most of them were 
> introduced from SPARK-1189 but it's removed at SPARK-27004 and SPARK-33925 as 
> they are stale



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34520) Remove unused SecurityManager references

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34520.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 31636
[https://github.com/apache/spark/pull/31636]

> Remove unused SecurityManager references
> 
>
> Key: SPARK-34520
> URL: https://issues.apache.org/jira/browse/SPARK-34520
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.2.0
>
>
> Many places of SecurityManager is not used anymore. Most of them were 
> introduced from SPARK-1189 but it's removed at SPARK-27004 and SPARK-33925 as 
> they are stale



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Xiaochen Ouyang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290656#comment-17290656
 ] 

Xiaochen Ouyang commented on SPARK-33212:
-

{{Maybe we should confirm there are no Guava direct refrences in 
hadoop-yarn-server-web-proxy module. Otherwise it will bring some Guava 
conflicts.}}

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290652#comment-17290652
 ] 

Yang Jie commented on SPARK-34529:
--

There seems to be some discussion 
[before|https://github.com/apache/spark/pull/23080/files#r272690095]

> spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" 
> when parsing windows line feed (CR LF)
> 
>
> Key: SPARK-34529
> URL: https://issues.apache.org/jira/browse/SPARK-34529
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.1.1, 3.0.3
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> lineSep documentation says - 
> `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
> separator that should be used for parsing. Maximum length is 1 character.
> Reference: 
>  
> [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
> When reading csv file using spark
> src_df = (spark.read
> .option("header", "true")
> .option("multiLine","true")
> .option("escape", "ǁ")
>  .option("lineSep","\r\n")
> .schema(materialusetype_Schema)
> .option("badRecordsPath","/fh_badfile")
> .csv("/crlf.csv")
> )
> Below is the stack trace:
> java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
> only 1 character.java.lang.IllegalArgumentException: requirement failed: 
> 'lineSep' can contain only 1 character. at 
> scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
>  at 
> org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
> at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
> at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
>  at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
> org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) 

[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Description: 
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
feature, this introduce additional problems as follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
chunk order is not same as `blockIds`.

This will lead to the return data not match the blockId,  and this can lead to 
data corretness when retry to fetch after fetch block chunk failed.

Fetch chunk orker code and match blockId when rerun data code as follows: 

!image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!

Howerver, the fetch order in shuffle service,

!image-2021-02-25-11-30-03-834.png|width=510,height=361!

So, it will fetch some wrong block data when chunk fetch failed beause the 
blocks's wrong order.

!image-2021-02-25-11-31-59-110.png|width=601,height=204!

 

 

  was:
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
chunk order is not same as `blockIds`.

This will lead to the return data not match the blockId,  and this can lead to 
data corretness when retry to fetch after fetch block chunk failed.

Fetch chunk orker code and match blockId when rerun data code as follows: 

!image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!

Howerver, the fetch order in shuffle service,

!image-2021-02-25-11-30-03-834.png|width=510,height=361!

So, it will fetch some wrong block data when chunk fetch failed beause the 
blocks's wrong order.

!image-2021-02-25-11-31-59-110.png|width=601,height=204!

 

 


> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive 
> feature, this introduce additional problems as follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Description: 
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
chunk order is not same as `blockIds`.

This will lead to the return data not match the blockId,  and this can lead to 
data corretness when retry to fetch after fetch block chunk failed.

Fetch chunk orker code and match blockId when rerun data code as follows: 

!image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!

Howerver, the fetch order in shuffle service,

!image-2021-02-25-11-30-03-834.png|width=510,height=361!

So, it will fetch some wrong block data when chunk fetch failed beause the 
blocks's wrong order.

!image-2021-02-25-11-31-59-110.png|width=601,height=204!

 

 

  was:
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index.

!image-2021-02-25-11-17-12-714.png|width=875,height=502!


> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return 
> chunk order is not same as `blockIds`.
> This will lead to the return data not match the blockId,  and this can lead 
> to data corretness when retry to fetch after fetch block chunk failed.
> Fetch chunk orker code and match blockId when rerun data code as follows: 
> !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159!
> Howerver, the fetch order in shuffle service,
> !image-2021-02-25-11-30-03-834.png|width=510,height=361!
> So, it will fetch some wrong block data when chunk fetch failed beause the 
> blocks's wrong order.
> !image-2021-02-25-11-31-59-110.png|width=601,height=204!
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Attachment: image-2021-02-25-11-31-59-110.png

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index.
> !image-2021-02-25-11-17-12-714.png|width=875,height=502!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Attachment: image-2021-02-25-11-30-03-834.png

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, 
> image-2021-02-25-11-30-03-834.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index.
> !image-2021-02-25-11-17-12-714.png|width=875,height=502!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290650#comment-17290650
 ] 

Dongjoon Hyun commented on SPARK-33212:
---

Thanks!

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Attachment: image-2021-02-25-11-28-31-255.png

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index.
> !image-2021-02-25-11-17-12-714.png|width=875,height=502!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Attachment: image-2021-02-25-11-27-34-429.png

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png, 
> image-2021-02-25-11-27-34-429.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index.
> !image-2021-02-25-11-17-12-714.png|width=875,height=502!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'

2021-02-24 Thread Ted Yu (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290646#comment-17290646
 ] 

Ted Yu commented on SPARK-34532:


Included the test command and some more information in the description.

You should see these errors when you run the command.

> IntervalUtils.add() may result in 'long overflow'
> -
>
> Key: SPARK-34532
> URL: https://issues.apache.org/jira/browse/SPARK-34532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ted Yu
>Priority: Major
>
> I noticed the following when running test suite:
> build/sbt "sql/testOnly *SQLQueryTestSuite"
> {code}
> 19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 
> 6416.0 failed 1 times; aborting job
> [info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds)
> 19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 6476.0 (TID 7789)
> java.lang.ArithmeticException: long overflow
> at java.lang.Math.multiplyExact(Math.java:892)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
> at 
> org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
> {code}
> {code}
> 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 14744.0 (TID 16705)
> java.lang.ArithmeticException: long overflow
> at java.lang.Math.addExact(Math.java:809)
> at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105)
> at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268)
> at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97)
> {code}
> This likely was caused by the following line:
> {code}
> val microseconds = left.microseconds + right.microseconds
> {code}
> We should check whether the addition would produce overflow before adding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'

2021-02-24 Thread Ted Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ted Yu updated SPARK-34532:
---
Description: 
I noticed the following when running test suite:

build/sbt "sql/testOnly *SQLQueryTestSuite"
{code}
19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 
6416.0 failed 1 times; aborting job
[info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds)
19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in 
stage 6476.0 (TID 7789)
java.lang.ArithmeticException: long overflow
at java.lang.Math.multiplyExact(Math.java:892)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
 Source)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at 
org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
{code}
{code}
19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 14744.0 (TID 16705)
java.lang.ArithmeticException: long overflow
at java.lang.Math.addExact(Math.java:809)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97)
{code}
This likely was caused by the following line:
{code}
val microseconds = left.microseconds + right.microseconds
{code}
We should check whether the addition would produce overflow before adding.

  was:
I noticed the following when running test suite:
{code}
19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 14744.0 (TID 16705)
java.lang.ArithmeticException: long overflow
at java.lang.Math.addExact(Math.java:809)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97)
{code}
This likely was caused by the following line:
{code}
val microseconds = left.microseconds + right.microseconds
{code}
We should check whether the addition would produce overflow before adding.


> IntervalUtils.add() may result in 'long overflow'
> -
>
> Key: SPARK-34532
> URL: https://issues.apache.org/jira/browse/SPARK-34532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ted Yu
>Priority: Major
>
> I noticed the following when running test suite:
> build/sbt "sql/testOnly *SQLQueryTestSuite"
> {code}
> 19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 
> 6416.0 failed 1 times; aborting job
> [info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds)
> 19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 
> in stage 6476.0 (TID 7789)
> java.lang.ArithmeticException: long overflow
> at java.lang.Math.multiplyExact(Math.java:892)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown
>  Source)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
> at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
> at 
> 

[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Description: 
We will build a new rpc message `FetchShuffleBlocks` when 
`OneForOneBlockFetcher` init in replace of

`OpenBlocks` to use adaptive feature, this introduce additional problems as 
follows.

`OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
fetch success, it will use index in `blockIds` to fetch blocks and match 
blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
be consistent with fetchChunk index.

!image-2021-02-25-11-17-12-714.png|width=875,height=502!

  was:
We will build a new rpc message 
{code:java}
FetchShuffleBlocks{code}
when
{code:java}
OneForOneBlockFetcher{code}
init in replace of
{code:java}
OpenBlocks{code}
to use adaptive feature


> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png
>
>
> We will build a new rpc message `FetchShuffleBlocks` when 
> `OneForOneBlockFetcher` init in replace of
> `OpenBlocks` to use adaptive feature, this introduce additional problems as 
> follows.
> `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk 
> fetch success, it will use index in `blockIds` to fetch blocks and match 
> blockId in `blockIds` when chunk data return. So the `blockIds` 's order must 
> be consistent with fetchChunk index.
> !image-2021-02-25-11-17-12-714.png|width=875,height=502!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Attachment: image-2021-02-25-11-17-12-714.png

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
> Attachments: image-2021-02-25-11-17-12-714.png
>
>
> We will build a new rpc message 
> {code:java}
> FetchShuffleBlocks{code}
> when
> {code:java}
> OneForOneBlockFetcher{code}
> init in replace of
> {code:java}
> OpenBlocks{code}
> to use adaptive feature



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-24 Thread zhengruifeng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290642#comment-17290642
 ] 

zhengruifeng commented on SPARK-34448:
--

[~srowen] Thanks for pinging me, I am going to look into this issue

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Description: 
We will build a new rpc message 
{code:java}
FetchShuffleBlocks{code}
when
{code:java}
OneForOneBlockFetcher{code}
init in replace of
{code:java}
OpenBlocks{code}
to use adaptive feature

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
>
> We will build a new rpc message 
> {code:java}
> FetchShuffleBlocks{code}
> when
> {code:java}
> OneForOneBlockFetcher{code}
> init in replace of
> {code:java}
> OpenBlocks{code}
> to use adaptive feature



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haiyangyu updated SPARK-34534:
--
Summary: New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to 
data loss or correctness  (was: FetchShuffleBlocks in OneForOneBlockFetcher 
lead to data loss or correctness)

> New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or 
> correctness
> -
>
> Key: SPARK-34534
> URL: https://issues.apache.org/jira/browse/SPARK-34534
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 3.0.0, 3.0.1, 3.0.2
>Reporter: haiyangyu
>Priority: Major
>  Labels: Correctness, data-loss
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34534) FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness

2021-02-24 Thread haiyangyu (Jira)
haiyangyu created SPARK-34534:
-

 Summary: FetchShuffleBlocks in OneForOneBlockFetcher lead to data 
loss or correctness
 Key: SPARK-34534
 URL: https://issues.apache.org/jira/browse/SPARK-34534
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 3.0.2, 3.0.1, 3.0.0
Reporter: haiyangyu






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34530) logError for interrupting block migrations is too high

2021-02-24 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290633#comment-17290633
 ] 

Yang Jie commented on SPARK-34530:
--

[~holden] Can you add some descriptions?

> logError for interrupting block migrations is too high
> --
>
> Key: SPARK-34530
> URL: https://issues.apache.org/jira/browse/SPARK-34530
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.0, 3.2.0, 3.1.1
>Reporter: Holden Karau
>Priority: Trivial
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'

2021-02-24 Thread Yang Jie (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290632#comment-17290632
 ] 

Yang Jie commented on SPARK-34532:
--

Which case have this problem?

> IntervalUtils.add() may result in 'long overflow'
> -
>
> Key: SPARK-34532
> URL: https://issues.apache.org/jira/browse/SPARK-34532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2
>Reporter: Ted Yu
>Priority: Major
>
> I noticed the following when running test suite:
> {code}
> 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 
> in stage 14744.0 (TID 16705)
> java.lang.ArithmeticException: long overflow
> at java.lang.Math.addExact(Math.java:809)
> at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105)
> at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104)
> at 
> org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268)
> at 
> org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573)
> at 
> org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97)
> {code}
> This likely was caused by the following line:
> {code}
> val microseconds = left.microseconds + right.microseconds
> {code}
> We should check whether the addition would produce overflow before adding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290631#comment-17290631
 ] 

Kent Yao edited comment on SPARK-34523 at 2/25/21, 2:56 AM:


Hi [~dongjoon], thanks for your suggestions.  When the problem goes to JDK, the 
solution is often to simply upgrade the JDK and be done with it. But I guess 
the hardest part for users may be to collect clues and find the corresponding 
problem. A documentation PR is a good choice and the detailed JIRA also helps.

BTW,

> Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.  

This statement is spark-version specific and too brief to get much users' 
attention.




was (Author: qin yao):

Hi [~dongjoon], thanks for your suggestions.  When the problem goes to JDK, the 
solution is often to simply upgrade the JDK and be done with it. But I guess 
the hardest part for users may be to collect clues and find the corresponding 
problem. A documentation PR is a good choice and the detailed JIRA also helps.

> Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.  

This statement is spark-version specific and too brief to get much users' 
attention.



> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> h2. Instruction
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some of the tasks hang for hours and 
> all others complete without delay.
>  
> !screenshot-2.png! 
> Also, you may find that these hanging tasks belong to the same executors.
> Usually, in this case, you will also get nothing helpful from the executor 
> log.
> If you print the executor jstack or you check the ThreadDump via SparkUI 
> executor tab and you find some task thread blocked like below, you are very 
> likely to hit the JDK-8194653 issue.
> !screenshot-1.png! 
> h2. Solutions
> Here are some options to circumvent this problem:
> 1. For the cluster managers side, you can update the JDK version according to 
> https://bugs.openjdk.java.net/browse/JDK-8194653
> 2. If you are not able to update the JDK version for the cluster entirely, 
> you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your 
> apps
> 2. Also, turn on `spark.speculation` may let spark automatically re-run the 
> hanging tasks and bypass the problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290631#comment-17290631
 ] 

Kent Yao commented on SPARK-34523:
--


Hi [~dongjoon], thanks for your suggestions.  When the problem goes to JDK, the 
solution is often to simply upgrade the JDK and be done with it. But I guess 
the hardest part for users may be to collect clues and find the corresponding 
problem. A documentation PR is a good choice and the detailed JIRA also helps.

> Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0.  

This statement is spark-version specific and too brief to get much users' 
attention.



> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> h2. Instruction
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some of the tasks hang for hours and 
> all others complete without delay.
>  
> !screenshot-2.png! 
> Also, you may find that these hanging tasks belong to the same executors.
> Usually, in this case, you will also get nothing helpful from the executor 
> log.
> If you print the executor jstack or you check the ThreadDump via SparkUI 
> executor tab and you find some task thread blocked like below, you are very 
> likely to hit the JDK-8194653 issue.
> !screenshot-1.png! 
> h2. Solutions
> Here are some options to circumvent this problem:
> 1. For the cluster managers side, you can update the JDK version according to 
> https://bugs.openjdk.java.net/browse/JDK-8194653
> 2. If you are not able to update the JDK version for the cluster entirely, 
> you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your 
> apps
> 2. Also, turn on `spark.speculation` may let spark automatically re-run the 
> hanging tasks and bypass the problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE

2021-02-24 Thread Cheng Su (Jira)
Cheng Su created SPARK-34533:


 Summary: Eliminate LEFT ANTI join to empty relation in AQE
 Key: SPARK-34533
 URL: https://issues.apache.org/jira/browse/SPARK-34533
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Cheng Su


I discovered from review discussion - 
[https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we 
can eliminate LEFT ANTI join (with no join condition) to empty relation, if the 
right side is known to be non-empty. So with AQE, this is doable similar to 
[https://github.com/apache/spark/pull/29484] .



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290613#comment-17290613
 ] 

Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM:


I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark 
- all the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.


was (Author: csun):
I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34497:
--
Affects Version/s: (was: 3.1.2)
   3.1.1

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.2
>
>
> Some of the built-in JDBC connection providers are changing the JVM security 
> context to do the authentication which is fine. The problematic part is that 
> executors can be reused by another query. The following situation leads to 
> incorrect behaviour:
>  * Query1 opens JDBC connection and changes JVM security context in Executor1
>  * Query2 tries to open JDBC connection but it realizes there is already an 
> entry for that DB type in Executor1
>  * Query2 is not changing JVM security context and uses Query1 keytab and 
> principal
>  * Query2 fails with authentication error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290613#comment-17290613
 ] 

Chao Sun commented on SPARK-33212:
--

I was able to reproduce the error in my local environment, and find a potential 
fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all 
the other YARN jars are already covered by {{hadoop-client-api}} and 
{{hadoop-client-runtime}}. I'll open a PR for this soon.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34531:
--
Issue Type: Bug  (was: Improvement)

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.2
>
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34531:
--
Affects Version/s: (was: 3.1.2)
   3.1.1

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2, 3.2.0, 3.1.1
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.2
>
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34531:
--
Affects Version/s: 3.0.2

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.2
>
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-34531:
--
Affects Version/s: 3.1.2

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0, 3.1.2
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.2
>
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reassigned SPARK-34531:
-

Assignee: Hyukjin Kwon

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Dongjoon Hyun (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-34531.
---
Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 31640
[https://github.com/apache/spark/pull/31640]

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 3.1.2
>
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Description: 
h2. Instruction

This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some of the tasks hang for hours and 
all others complete without delay.
 
!screenshot-2.png! 

Also, you may find that these hanging tasks belong to the same executors.
Usually, in this case, you will also get nothing helpful from the executor log.

If you print the executor jstack or you check the ThreadDump via SparkUI 
executor tab and you find some task thread blocked like below, you are very 
likely to hit the JDK-8194653 issue.
!screenshot-1.png! 

h2. Solutions

Here are some options to circumvent this problem:

1. For the cluster managers side, you can update the JDK version according to 
https://bugs.openjdk.java.net/browse/JDK-8194653
2. If you are not able to update the JDK version for the cluster entirely, you 
can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your apps
2. Also, turn on `spark.speculation` may let spark automatically re-run the 
hanging tasks and bypass the problem

  was:
h2. Instruction

This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some of the tasks hang for hours and 
all others complete without delay.
 
!screenshot-2.png! 

Also, you may find that these hanging tasks belong to the same executors.
Usually, in this case, you will also get nothing helpful from the executor log.

If you print the executor jstack or you check the ThreadDump via SparkUI 
executor tab and you find some task thread blocked like below, you are very 
likely to hit the JDK-8194653 issue.
!screenshot-1.png! 

h2. Solutions

1. Update JDK version 


> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> h2. Instruction
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some of the tasks hang for hours and 
> all others complete without delay.
>  
> !screenshot-2.png! 
> Also, you may find that these hanging tasks belong to the same executors.
> Usually, in this case, you will also get nothing helpful from the executor 
> log.
> If you print the executor jstack or you check the ThreadDump via SparkUI 
> executor tab and you find some task thread blocked like below, you are very 
> likely to hit the JDK-8194653 issue.
> !screenshot-1.png! 
> h2. Solutions
> Here are some options to circumvent this problem:
> 1. For the cluster managers side, you can update the JDK version according to 
> https://bugs.openjdk.java.net/browse/JDK-8194653
> 2. If you are not able to update the JDK version for the cluster entirely, 
> you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your 
> apps
> 2. Also, turn on `spark.speculation` may let spark automatically re-run the 
> hanging tasks and bypass the problem



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue

2021-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290599#comment-17290599
 ] 

Dongjoon Hyun commented on SPARK-34516:
---

Hi, [~angerszhuuu]. Could you provide a reproducer? For now, there is nothing 
much we can do.

> Spark 3.0.1 encounter parquet PageHerder IO issue
> -
>
> Key: SPARK-34516
> URL: https://issues.apache.org/jira/browse/SPARK-34516
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.1
>Reporter: angerszhu
>Priority: Major
>
> {code:java}
> Caused by: java.io.IOException: can not read class 
> org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' 
> was not found in serialized data! Struct: 
> org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d
>   at org.apache.parquet.format.Util.read(Util.java:216)
>   at org.apache.parquet.format.Util.readPageHeader(Util.java:65)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950)
>   at 
> org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268)
>   at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171)
>   at 
> org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173)
>   at 
> org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93)
>   at 
> org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491)
>   at 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Description: 
h2. Instruction

This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some of the tasks hang for hours and 
all others complete without delay.
 
!screenshot-2.png! 

Also, you may find that these hanging tasks belong to the same executors.
Usually, in this case, you will also get nothing helpful from the executor log.

If you print the executor jstack or you check the ThreadDump via SparkUI 
executor tab and you find some task thread blocked like below, you are very 
likely to hit the JDK-8194653 issue.
!screenshot-1.png! 

h2. Solutions

1. Update JDK version 

  was:
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some of the tasks hang for hours and 
all others complete without delay.
 
!screenshot-2.png! 

Also, you may find that these hanging tasks belong to the same executors.
Usually, in this case, you will also get nothing helpful from the executor log.

If you print the executor jstack or you check the ThreadDump via SparkUI 
executor tab and you find some task thread blocked like below, you are very 
likely to hit the JDK
!screenshot-1.png! 


> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> h2. Instruction
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some of the tasks hang for hours and 
> all others complete without delay.
>  
> !screenshot-2.png! 
> Also, you may find that these hanging tasks belong to the same executors.
> Usually, in this case, you will also get nothing helpful from the executor 
> log.
> If you print the executor jstack or you check the ThreadDump via SparkUI 
> executor tab and you find some task thread blocked like below, you are very 
> likely to hit the JDK-8194653 issue.
> !screenshot-1.png! 
> h2. Solutions
> 1. Update JDK version 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Description: 
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some of the tasks hang for hours and 
all others complete without delay.
 
!screenshot-2.png! 

Also, you may find that these hanging tasks belong to the same executors.
Usually, in this case, you will also get nothing helpful from the executor log.

If you print the executor jstack or you check the ThreadDump via SparkUI 
executor tab and you find some task thread blocked like below, you are very 
likely to hit the JDK
!screenshot-1.png! 

  was:
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some task hang for hours
 !screenshot-2.png! 

!screenshot-1.png! 


> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some of the tasks hang for hours and 
> all others complete without delay.
>  
> !screenshot-2.png! 
> Also, you may find that these hanging tasks belong to the same executors.
> Usually, in this case, you will also get nothing helpful from the executor 
> log.
> If you print the executor jstack or you check the ThreadDump via SparkUI 
> executor tab and you find some task thread blocked like below, you are very 
> likely to hit the JDK
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Description: 
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
In the Spark UI stage tab, you may find some task hang for hours
 !screenshot-2.png! 

!screenshot-1.png! 

  was:
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
!screenshot-1.png! 


> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> In the Spark UI stage tab, you may find some task hang for hours
>  !screenshot-2.png! 
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Attachment: screenshot-2.png

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png, screenshot-2.png
>
>
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-34497:


Assignee: Gabor Somogyi

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
>
> Some of the built-in JDBC connection providers are changing the JVM security 
> context to do the authentication which is fine. The problematic part is that 
> executors can be reused by another query. The following situation leads to 
> incorrect behaviour:
>  * Query1 opens JDBC connection and changes JVM security context in Executor1
>  * Query2 tries to open JDBC connection but it realizes there is already an 
> entry for that DB type in Executor1
>  * Query2 is not changing JVM security context and uses Query1 keytab and 
> principal
>  * Query2 fails with authentication error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context

2021-02-24 Thread Hyukjin Kwon (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-34497.
--
Fix Version/s: 3.1.2
   Resolution: Fixed

Issue resolved by pull request 31622
[https://github.com/apache/spark/pull/31622]

> JDBC connection provider is not removing kerberos credentials from JVM 
> security context
> ---
>
> Key: SPARK-34497
> URL: https://issues.apache.org/jira/browse/SPARK-34497
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.2, 3.2.0, 3.1.2
>Reporter: Gabor Somogyi
>Assignee: Gabor Somogyi
>Priority: Major
> Fix For: 3.1.2
>
>
> Some of the built-in JDBC connection providers are changing the JVM security 
> context to do the authentication which is fine. The problematic part is that 
> executors can be reused by another query. The following situation leads to 
> incorrect behaviour:
>  * Query1 opens JDBC connection and changes JVM security context in Executor1
>  * Query2 tries to open JDBC connection but it realizes there is already an 
> entry for that DB type in Executor1
>  * Query2 is not changing JVM security context and uses Query1 keytab and 
> principal
>  * Query2 fails with authentication error



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'

2021-02-24 Thread Ted Yu (Jira)
Ted Yu created SPARK-34532:
--

 Summary: IntervalUtils.add() may result in 'long overflow'
 Key: SPARK-34532
 URL: https://issues.apache.org/jira/browse/SPARK-34532
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.2
Reporter: Ted Yu


I noticed the following when running test suite:
{code}
19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in 
stage 14744.0 (TID 16705)
java.lang.ArithmeticException: long overflow
at java.lang.Math.addExact(Math.java:809)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105)
at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104)
at 
org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268)
at 
org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573)
at 
org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97)
{code}
This likely was caused by the following line:
{code}
val microseconds = left.microseconds + right.microseconds
{code}
We should check whether the addition would produce overflow before adding.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Description: 
This will cause deadlock and hangs concurrent tasks forever on the same 
executor. for example,
 
!screenshot-1.png! 

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png
>
>
> This will cause deadlock and hangs concurrent tasks forever on the same 
> executor. for example,
>  
> !screenshot-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Attachment: screenshot-1.png

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log, screenshot-1.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34531:


Assignee: (was: Apache Spark)

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290561#comment-17290561
 ] 

Apache Spark commented on SPARK-34531:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31640

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34531:


Assignee: Apache Spark

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290563#comment-17290563
 ] 

Apache Spark commented on SPARK-34531:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/31640

> Remove Experimental API tag in PrometheusServlet
> 
>
> Key: SPARK-34531
> URL: https://issues.apache.org/jira/browse/SPARK-34531
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
> actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34531) Remove Experimental API tag in PrometheusServlet

2021-02-24 Thread Hyukjin Kwon (Jira)
Hyukjin Kwon created SPARK-34531:


 Summary: Remove Experimental API tag in PrometheusServlet
 Key: SPARK-34531
 URL: https://issues.apache.org/jira/browse/SPARK-34531
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Hyukjin Kwon


SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is 
actually not needed because the class itself isn't an API.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34530) logError for interrupting block migrations is too high

2021-02-24 Thread Holden Karau (Jira)
Holden Karau created SPARK-34530:


 Summary: logError for interrupting block migrations is too high
 Key: SPARK-34530
 URL: https://issues.apache.org/jira/browse/SPARK-34530
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.0, 3.2.0, 3.1.1
Reporter: Holden Karau






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Takeshi Yamamuro (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290539#comment-17290539
 ] 

Takeshi Yamamuro commented on SPARK-34529:
--

Since I think this is not a bug but improvement, I changed the type.

> spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" 
> when parsing windows line feed (CR LF)
> 
>
> Key: SPARK-34529
> URL: https://issues.apache.org/jira/browse/SPARK-34529
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.1.1, 3.0.3
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> lineSep documentation says - 
> `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
> separator that should be used for parsing. Maximum length is 1 character.
> Reference: 
>  
> [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
> When reading csv file using spark
> src_df = (spark.read
> .option("header", "true")
> .option("multiLine","true")
> .option("escape", "ǁ")
>  .option("lineSep","\r\n")
> .schema(materialusetype_Schema)
> .option("badRecordsPath","/fh_badfile")
> .csv("/crlf.csv")
> )
> Below is the stack trace:
> java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
> only 1 character.java.lang.IllegalArgumentException: requirement failed: 
> 'lineSep' can contain only 1 character. at 
> scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
>  at 
> org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
> at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
> at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
>  at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
> org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at 
> 

[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34529:
-
Component/s: (was: Spark Core)
 SQL

> spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" 
> when parsing windows line feed (CR LF)
> 
>
> Key: SPARK-34529
> URL: https://issues.apache.org/jira/browse/SPARK-34529
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, SQL
>Affects Versions: 3.2.0, 3.1.1, 3.0.3
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> lineSep documentation says - 
> `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
> separator that should be used for parsing. Maximum length is 1 character.
> Reference: 
>  
> [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
> When reading csv file using spark
> src_df = (spark.read
> .option("header", "true")
> .option("multiLine","true")
> .option("escape", "ǁ")
>  .option("lineSep","\r\n")
> .schema(materialusetype_Schema)
> .option("badRecordsPath","/fh_badfile")
> .csv("/crlf.csv")
> )
> Below is the stack trace:
> java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
> only 1 character.java.lang.IllegalArgumentException: requirement failed: 
> 'lineSep' can contain only 1 character. at 
> scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
>  at 
> org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
> at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
> at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
>  at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
> org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at 
> org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003)



[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34529:
-
Affects Version/s: (was: 3.0.1)
   3.0.3
   3.1.1
   3.2.0

> spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" 
> when parsing windows line feed (CR LF)
> 
>
> Key: SPARK-34529
> URL: https://issues.apache.org/jira/browse/SPARK-34529
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.2.0, 3.1.1, 3.0.3
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> lineSep documentation says - 
> `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
> separator that should be used for parsing. Maximum length is 1 character.
> Reference: 
>  
> [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
> When reading csv file using spark
> src_df = (spark.read
> .option("header", "true")
> .option("multiLine","true")
> .option("escape", "ǁ")
>  .option("lineSep","\r\n")
> .schema(materialusetype_Schema)
> .option("badRecordsPath","/fh_badfile")
> .csv("/crlf.csv")
> )
> Below is the stack trace:
> java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
> only 1 character.java.lang.IllegalArgumentException: requirement failed: 
> 'lineSep' can contain only 1 character. at 
> scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
>  at 
> org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
> at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
> at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
>  at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
> org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
>  at 

[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Takeshi Yamamuro (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Takeshi Yamamuro updated SPARK-34529:
-
Issue Type: Improvement  (was: Bug)

> spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" 
> when parsing windows line feed (CR LF)
> 
>
> Key: SPARK-34529
> URL: https://issues.apache.org/jira/browse/SPARK-34529
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 3.0.1
>Reporter: Shanmugavel Kuttiyandi Chandrakasu
>Priority: Minor
>
> lineSep documentation says - 
> `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
> separator that should be used for parsing. Maximum length is 1 character.
> Reference: 
>  
> [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]
> When reading csv file using spark
> src_df = (spark.read
> .option("header", "true")
> .option("multiLine","true")
> .option("escape", "ǁ")
>  .option("lineSep","\r\n")
> .schema(materialusetype_Schema)
> .option("badRecordsPath","/fh_badfile")
> .csv("/crlf.csv")
> )
> Below is the stack trace:
> java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
> only 1 character.java.lang.IllegalArgumentException: requirement failed: 
> 'lineSep' can contain only 1 character. at 
> scala.Predef$.require(Predef.scala:281) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
>  at scala.Option.map(Option.scala:230) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
> org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
> org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
>  at 
> org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
>  at 
> org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
>  at 
> org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
>  at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165)
>  at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
> at 
> org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
> at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
>  at 
> org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
>  at scala.Option.getOrElse(Option.scala:189) at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
>  at 
> org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
>  at 
> org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
>  at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
> org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
> org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
>  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
>  at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
>  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at 
> org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003)



--
This message was sent 

[jira] [Assigned] (SPARK-34528) View result are not consistent after a modification inside a struct of the table

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34528:


Assignee: Apache Spark

> View result are not consistent after a modification inside a struct of the 
> table
> 
>
> Key: SPARK-34528
> URL: https://issues.apache.org/jira/browse/SPARK-34528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Thomas Prelle
>Assignee: Apache Spark
>Priority: Major
>
> After [https://github.com/apache/spark/pull/31368] work to simplify hive view 
> resolution
> I found a bug because Hive allow you to change the order inside a struct
> 1) You create a table in hive with a struct:
>  CREATE table test_struct (id int, sub STRUCT );
> 2) You insert data into it :
> INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
> 3) Create a view on top of it :
> CREATE view test_view_struct as select id, sub from test_view_struct
> 4) Change the table struct reodoring the struct
> ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
> 5) Spark can not anymore query the view because struct in spark it's based on 
> the position not on the name of the column.
> If the changement it's castable you can even have a silent failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34528) View result are not consistent after a modification inside a struct of the table

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34528:


Assignee: (was: Apache Spark)

> View result are not consistent after a modification inside a struct of the 
> table
> 
>
> Key: SPARK-34528
> URL: https://issues.apache.org/jira/browse/SPARK-34528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Thomas Prelle
>Priority: Major
>
> After [https://github.com/apache/spark/pull/31368] work to simplify hive view 
> resolution
> I found a bug because Hive allow you to change the order inside a struct
> 1) You create a table in hive with a struct:
>  CREATE table test_struct (id int, sub STRUCT );
> 2) You insert data into it :
> INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
> 3) Create a view on top of it :
> CREATE view test_view_struct as select id, sub from test_view_struct
> 4) Change the table struct reodoring the struct
> ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
> 5) Spark can not anymore query the view because struct in spark it's based on 
> the position not on the name of the column.
> If the changement it's castable you can even have a silent failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34528) View result are not consistent after a modification inside a struct of the table

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290275#comment-17290275
 ] 

Apache Spark commented on SPARK-34528:
--

User 'tprelle' has created a pull request for this issue:
https://github.com/apache/spark/pull/31639

> View result are not consistent after a modification inside a struct of the 
> table
> 
>
> Key: SPARK-34528
> URL: https://issues.apache.org/jira/browse/SPARK-34528
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.2.0
>Reporter: Thomas Prelle
>Priority: Major
>
> After [https://github.com/apache/spark/pull/31368] work to simplify hive view 
> resolution
> I found a bug because Hive allow you to change the order inside a struct
> 1) You create a table in hive with a struct:
>  CREATE table test_struct (id int, sub STRUCT );
> 2) You insert data into it :
> INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
> 3) Create a view on top of it :
> CREATE view test_view_struct as select id, sub from test_view_struct
> 4) Change the table struct reodoring the struct
> ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
> 5) Spark can not anymore query the view because struct in spark it's based on 
> the position not on the name of the column.
> If the changement it's castable you can even have a silent failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)

2021-02-24 Thread Shanmugavel Kuttiyandi Chandrakasu (Jira)
Shanmugavel Kuttiyandi Chandrakasu created SPARK-34529:
--

 Summary: spark.read.csv is throwing exception ,"lineSep' can 
contain only 1 character" when parsing windows line feed (CR LF)
 Key: SPARK-34529
 URL: https://issues.apache.org/jira/browse/SPARK-34529
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 3.0.1
Reporter: Shanmugavel Kuttiyandi Chandrakasu


lineSep documentation says - 

`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line 
separator that should be used for parsing. Maximum length is 1 character.

Reference: 

 
[https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader]

When reading csv file using spark

src_df = (spark.read
.option("header", "true")
.option("multiLine","true")
.option("escape", "ǁ")
 .option("lineSep","\r\n")
.schema(materialusetype_Schema)
.option("badRecordsPath","/fh_badfile")
.csv("/crlf.csv")
)

Below is the stack trace:

java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain 
only 1 character.java.lang.IllegalArgumentException: requirement failed: 
'lineSep' can contain only 1 character. at 
scala.Predef$.require(Predef.scala:281) at 
org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209)
 at scala.Option.map(Option.scala:230) at 
org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at 
org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at 
org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108)
 at 
org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132)
 at 
org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123)
 at 
org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497)
 at 
org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196)
 at 
org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240)
 at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) 
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) 
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at 
org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) 
at 
org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) 
at 
org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61)
 at 
org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57)
 at 
org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483)
 at scala.Option.getOrElse(Option.scala:189) at 
org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483)
 at 
org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427)
 at 
org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58)
 at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at 
org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at 
org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116)
 at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248)
 at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101)
 at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at 
org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77)
 at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198)
 at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at 
org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34528) View result are not consistent after a modification inside a struct of the table

2021-02-24 Thread Thomas Prelle (Jira)
Thomas Prelle created SPARK-34528:
-

 Summary: View result are not consistent after a modification 
inside a struct of the table
 Key: SPARK-34528
 URL: https://issues.apache.org/jira/browse/SPARK-34528
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.2.0
Reporter: Thomas Prelle


After [https://github.com/apache/spark/pull/31368] work to simplify hive view 
resolution
I found a bug because Hive allow you to change the order inside a struct

1) You create a table in hive with a struct:
 CREATE table test_struct (id int, sub STRUCT );
2) You insert data into it :
INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1");
3) Create a view on top of it :
CREATE view test_view_struct as select id, sub from test_view_struct
4) Change the table struct reodoring the struct
ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>;
5) Spark can not anymore query the view because struct in spark it's based on 
the position not on the name of the column.
If the changement it's castable you can even have a silent failed



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.

2021-02-24 Thread Holden Karau (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Holden Karau resolved SPARK-32617.
--
Fix Version/s: 3.2.0
 Assignee: Attila Zsolt Piros
   Resolution: Fixed

> Upgrade kubernetes client version to support latest minikube version.
> -
>
> Key: SPARK-32617
> URL: https://issues.apache.org/jira/browse/SPARK-32617
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.1.0
>Reporter: Prashant Sharma
>Assignee: Attila Zsolt Piros
>Priority: Major
> Fix For: 3.2.0
>
>
> Following error comes, when the k8s integration tests are run against the 
> minikube cluster with version 1.2.1
> {code:java}
> Run starting. Expected test count is: 18
> KubernetesSuite:
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED ***
>   io.fabric8.kubernetes.client.KubernetesClientException: An error has 
> occurred.
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>   at 
> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196)
>   at 
> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62)
>   at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
>   at 
> io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
>   at 
> org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131)
>   at 
> org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
>   ...
>   Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt
>   at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
>   at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
>   at 
> sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
>   at java.nio.file.Files.newByteChannel(Files.java:361)
>   at java.nio.file.Files.newByteChannel(Files.java:407)
>   at java.nio.file.Files.readAllBytes(Files.java:3152)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72)
>   at 
> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242)
>   at 
> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>   ...
> Run completed in 1 second, 821 milliseconds.
> Total number of tests run: 0
> Suites: completed 1, aborted 1
> Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
> *** 1 SUITE ABORTED ***
> [INFO] 
> 
> [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT:
> [INFO] 
> [INFO] Spark Project Parent POM ... SUCCESS [  4.454 
> s]
> [INFO] Spark Project Tags . SUCCESS [  4.768 
> s]
> [INFO] Spark Project Local DB . SUCCESS [  2.961 
> s]
> [INFO] Spark Project Networking ... SUCCESS [  4.258 
> s]
> [INFO] Spark Project Shuffle Streaming Service  SUCCESS [  5.703 
> s]
> [INFO] Spark Project Unsafe ... SUCCESS [  3.239 
> s]
> [INFO] Spark Project Launcher . SUCCESS [  3.224 
> s]
> [INFO] Spark Project Core . SUCCESS [02:25 
> min]
> [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 
> s]
> [INFO] 
> 
> [INFO] BUILD FAILURE
> [INFO] 
> 
> [INFO] Total time:  03:12 min
> [INFO] Finished at: 2020-08-11T06:26:15-05:00
> [INFO] 
> 
> [ERROR] Failed to execute goal 
> org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project 
> spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1]
> [ERROR] 
> [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.
> [ERROR] Re-run Maven using the -X switch to enable full debug logging.
> [ERROR] 
> [ERROR] For more information about the 

[jira] [Commented] (SPARK-34527) De-duplicated common columns cannot be resolved from USING/NATURAL JOIN

2021-02-24 Thread Karen Feng (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290211#comment-17290211
 ] 

Karen Feng commented on SPARK-34527:


I've implemented a fix for this, will push a PR.

> De-duplicated common columns cannot be resolved from USING/NATURAL JOIN
> ---
>
> Key: SPARK-34527
> URL: https://issues.apache.org/jira/browse/SPARK-34527
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Karen Feng
>Priority: Minor
>
> USING/NATURAL JOINS today have unexpectedly asymmetric behavior when 
> resolving the duplicated common columns. For example, the left key columns 
> can be resolved from a USING INNER JOIN, but the right key columns cannot. 
> This is due to the Analyzer's 
> [rewrite|https://github.com/apache/spark/blob/999d3b89b6df14a5ccb94ffc2ffadb82964e9f7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3397]
>  of NATURAL/USING JOINs, which uses Project to remove the duplicated common 
> columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34527) De-duplicated common columns cannot be resolved from USING/NATURAL JOIN

2021-02-24 Thread Karen Feng (Jira)
Karen Feng created SPARK-34527:
--

 Summary: De-duplicated common columns cannot be resolved from 
USING/NATURAL JOIN
 Key: SPARK-34527
 URL: https://issues.apache.org/jira/browse/SPARK-34527
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Karen Feng


USING/NATURAL JOINS today have unexpectedly asymmetric behavior when resolving 
the duplicated common columns. For example, the left key columns can be 
resolved from a USING INNER JOIN, but the right key columns cannot. This is due 
to the Analyzer's 
[rewrite|https://github.com/apache/spark/blob/999d3b89b6df14a5ccb94ffc2ffadb82964e9f7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3397]
 of NATURAL/USING JOINs, which uses Project to remove the duplicated common 
columns.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34526:


Assignee: (was: Apache Spark)

> Add a flag to skip checking file sink format and handle glob path
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket fixes the following issues related to file sink format checking 
> together:
>  * Some users may use a very long glob path to read and `isDirectory` may 
> fail when the path is too long. We should ignore the error when the path is a 
> glob path since the file streaming sink doesn’t support glob paths.
>  * Checking whether a directory is outputted by File Streaming Sink may fail 
> for various issues happening in the storage. We should add a flag to allow 
> users to disable the checking logic and read the directory as a batch output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290177#comment-17290177
 ] 

Apache Spark commented on SPARK-34526:
--

User 'xuanyuanking' has created a pull request for this issue:
https://github.com/apache/spark/pull/31638

> Add a flag to skip checking file sink format and handle glob path
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket fixes the following issues related to file sink format checking 
> together:
>  * Some users may use a very long glob path to read and `isDirectory` may 
> fail when the path is too long. We should ignore the error when the path is a 
> glob path since the file streaming sink doesn’t support glob paths.
>  * Checking whether a directory is outputted by File Streaming Sink may fail 
> for various issues happening in the storage. We should add a flag to allow 
> users to disable the checking logic and read the directory as a batch output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34526:


Assignee: Apache Spark

> Add a flag to skip checking file sink format and handle glob path
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Assignee: Apache Spark
>Priority: Major
>
> This ticket fixes the following issues related to file sink format checking 
> together:
>  * Some users may use a very long glob path to read and `isDirectory` may 
> fail when the path is too long. We should ignore the error when the path is a 
> glob path since the file streaming sink doesn’t support glob paths.
>  * Checking whether a directory is outputted by File Streaming Sink may fail 
> for various issues happening in the storage. We should add a flag to allow 
> users to disable the checking logic and read the directory as a batch output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path

2021-02-24 Thread Yuanjian Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yuanjian Li updated SPARK-34526:

Description: 
This ticket fixes the following issues related to file sink format checking 
together:
 * Some users may use a very long glob path to read and `isDirectory` may fail 
when the path is too long. We should ignore the error when the path is a glob 
path since the file streaming sink doesn’t support glob paths.
 * Checking whether a directory is outputted by File Streaming Sink may fail 
for various issues happening in the storage. We should add a flag to allow 
users to disable the checking logic and read the directory as a batch output.

  was:
This ticket fixes the following issues related to file sink format checking 
together:
 * Some users may use a very long glob path to read and `isDirectory`{{}} may 
fail when the path is too long.  We should ignore the error when the path is a 
glob path since file streaming sink doesn’t support glob paths.
 * Checking whether a directory is outputted by File Streaming Sink may fail 
for various issues happening in the storage. We should add a flag to allow 
users to disable it.


> Add a flag to skip checking file sink format and handle glob path
> -
>
> Key: SPARK-34526
> URL: https://issues.apache.org/jira/browse/SPARK-34526
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.1.0
>Reporter: Yuanjian Li
>Priority: Major
>
> This ticket fixes the following issues related to file sink format checking 
> together:
>  * Some users may use a very long glob path to read and `isDirectory` may 
> fail when the path is too long. We should ignore the error when the path is a 
> glob path since the file streaming sink doesn’t support glob paths.
>  * Checking whether a directory is outputted by File Streaming Sink may fail 
> for various issues happening in the storage. We should add a flag to allow 
> users to disable the checking logic and read the directory as a batch output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path

2021-02-24 Thread Yuanjian Li (Jira)
Yuanjian Li created SPARK-34526:
---

 Summary: Add a flag to skip checking file sink format and handle 
glob path
 Key: SPARK-34526
 URL: https://issues.apache.org/jira/browse/SPARK-34526
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.1.0
Reporter: Yuanjian Li


This ticket fixes the following issues related to file sink format checking 
together:
 * Some users may use a very long glob path to read and `isDirectory`{{}} may 
fail when the path is too long.  We should ignore the error when the path is a 
glob path since file streaming sink doesn’t support glob paths.
 * Checking whether a directory is outputted by File Streaming Sink may fail 
for various issues happening in the storage. We should add a flag to allow 
users to disable it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-24 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290158#comment-17290158
 ] 

Sean R. Owen commented on SPARK-34448:
--

I crudely ported the test setup to a Scala test, and tried a 0 initial 
intercept in the LR implementation. It still gets the -3.5 intercept in the 
case where the 'const_feature' column is added, but -4 without. So, I'm not 
sure that's it.

Let me ping [~podongfeng] or maybe even [~sethah] who have worked on that code 
a bit and might have more of an idea about why the intercept wouldn't quite fit 
right in this case. I'm wondering if there is some issue in 
LogisticAggregator's treatment of the intercept? no idea, this is outside my 
expertise.

https://github.com/apache/spark/blob/3ce4ab545bfc28db7df2c559726b887b0c8c33b7/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L244

BTW here's my hacked up test: 

{code}
  test("BLR") {
val centered = false
val regParam = 1.0e-8
val num_distribution_samplings = 1000
val num_rows_per_sampling = 1000
val theta_1 = 0.3f
val theta_2 = 0.2f
val intercept = -4.0f

val (feature1, feature2, target) = generate_blr_data(theta_1, theta_2, 
intercept, centered,
  num_distribution_samplings, num_rows_per_sampling)

val num_rows = num_distribution_samplings * num_rows_per_sampling

val const_feature = Array.fill(num_rows)(1.0f)
(0 until num_rows / 10).foreach { i => const_feature(i) = 0.9f }


val data = (0 until num_rows).map { i =>
  (feature1(i), feature2(i), const_feature(i), target(i))
}

val spark_df = spark.createDataFrame(data).toDF("feature1", "feature2", 
"const_feature", "label").cache()

val vec = new VectorAssembler().setInputCols(Array("feature1", 
"feature2")).setOutputCol(("features"))
val spark_df1 = vec.transform(spark_df).cache()

val lr = new LogisticRegression().
  
setMaxIter(100).setRegParam(regParam).setElasticNetParam(0.5).setFitIntercept(true)
val lrModel = lr.fit(spark_df1)
println("Just the blr data")
println("Coefficients: " + lrModel.coefficients)
println("Intercept: " + lrModel.intercept)

val vec2 = new VectorAssembler().setInputCols(Array("feature1", "feature2", 
"const_feature")).
  setOutputCol(("features"))
val spark_df2 = vec2.transform(spark_df).cache()

val lrModel2 = lr.fit(spark_df2)
println("blr data plus one vector that is filled with 1's and .9's")
println("Coefficients: " + lrModel2.coefficients)
println("Intercept: " + lrModel2.intercept)

  }

  def generate_blr_data(theta_1: Float,
theta_2: Float,
intercept: Float,
centered: Boolean,
num_distribution_samplings: Int,
num_rows_per_sampling: Int): (Array[Float], 
Array[Float], Array[Int]) = {
val random = new Random(12345L)
val uniforms = Array.fill(num_distribution_samplings)(random.nextFloat())
val uniforms2 = Array.fill(num_distribution_samplings)(random.nextFloat())

if (centered) {
  uniforms.transform(f => f - 0.5f)
  uniforms2.transform(f => 2.0f * f - 1.0f)
} else {
  uniforms2.transform(f => f + 1.0f)
}

val h_theta = uniforms.zip(uniforms2).map { case (a, b) => intercept + 
theta_1 * a + theta_2 * b }
val prob = h_theta.map(t => 1.0 / (1.0 + math.exp(-t)))
val array = Array.ofDim[Int](num_distribution_samplings, 
num_rows_per_sampling)
array.indices.foreach { i =>
  (0 until math.round(num_rows_per_sampling * prob(i)).toInt).foreach { j =>
array(i)(j) = 1
  }
}

val num_rows = num_distribution_samplings * num_rows_per_sampling

val feature_1 = uniforms.map(f => 
Array.fill(num_rows_per_sampling)(f)).flatten
val feature_2 = uniforms2.map(f => 
Array.fill(num_rows_per_sampling)(f)).flatten
val target = array.flatten

return (feature_1, feature_2, target)
  }
{code}

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards 

[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile

2021-02-24 Thread Chao Sun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290127#comment-17290127
 ] 

Chao Sun commented on SPARK-33212:
--

Thanks again [~ouyangxc.zte]. 
{{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included 
in the {{hadoop-client}} jars since it is a server-side class and ideally 
should not be exposed to client applications such as Spark. 

[~dongjoon] Let me see how we can fix this either in Spark or Hadoop.

> Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
> -
>
> Key: SPARK-33212
> URL: https://issues.apache.org/jira/browse/SPARK-33212
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Spark Submit, SQL, YARN
>Affects Versions: 3.0.1
>Reporter: Chao Sun
>Assignee: Chao Sun
>Priority: Major
>  Labels: releasenotes
> Fix For: 3.2.0
>
>
> Hadoop 3.x+ offers shaded client jars: hadoop-client-api and 
> hadoop-client-runtime, which shade 3rd party dependencies such as Guava, 
> protobuf, jetty etc. This Jira switches Spark to use these jars instead of 
> hadoop-common, hadoop-client etc. Benefits include:
>  * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer 
> versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava 
> conflicts, Spark depends on Hadoop to not leaking dependencies.
>  * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both 
> client-side and server-side Hadoop APIs from modules such as hadoop-common, 
> hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only 
> use public/client API from Hadoop side.
>  * Provides a better isolation from Hadoop dependencies. In future Spark can 
> better evolve without worrying about dependencies pulled from Hadoop side 
> (which used to be a lot).
> *There are some behavior changes introduced with this JIRA, when people use 
> Spark compiled with Hadoop 3.x:*
> - Users now need to make sure class path contains `hadoop-client-api` and 
> `hadoop-client-runtime` jars when they deploy Spark with the 
> `hadoop-provided` option. In addition, it is high recommended that they put 
> these two jars before other Hadoop jars in the class path. Otherwise, 
> conflicts such as from Guava could happen if classes are loaded from the 
> other non-shaded Hadoop jars.
> - Since the new shaded Hadoop clients no longer include 3rd party 
> dependencies. Users who used to depend on these now need to explicitly put 
> the jars in their class path.
> Ideally the above should go to release notes.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34524) simplify v2 partition commands resolution

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290120#comment-17290120
 ] 

Apache Spark commented on SPARK-34524:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31637

> simplify v2 partition commands resolution
> -
>
> Key: SPARK-34524
> URL: https://issues.apache.org/jira/browse/SPARK-34524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34524) simplify v2 partition commands resolution

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34524:


Assignee: (was: Apache Spark)

> simplify v2 partition commands resolution
> -
>
> Key: SPARK-34524
> URL: https://issues.apache.org/jira/browse/SPARK-34524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-34524) simplify v2 partition commands resolution

2021-02-24 Thread Apache Spark (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-34524:


Assignee: Apache Spark

> simplify v2 partition commands resolution
> -
>
> Key: SPARK-34524
> URL: https://issues.apache.org/jira/browse/SPARK-34524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34524) simplify v2 partition commands resolution

2021-02-24 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290119#comment-17290119
 ] 

Apache Spark commented on SPARK-34524:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/31637

> simplify v2 partition commands resolution
> -
>
> Key: SPARK-34524
> URL: https://issues.apache.org/jira/browse/SPARK-34524
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34525) Update Spark Create Table DDL Docs

2021-02-24 Thread Xiao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-34525:

Labels: starter  (was: )

> Update Spark Create Table DDL Docs
> --
>
> Key: SPARK-34525
> URL: https://issues.apache.org/jira/browse/SPARK-34525
> Project: Spark
>  Issue Type: Improvement
>  Components: docs, Documentation
>Affects Versions: 3.0.3
>Reporter: Miklos Christine
>Priority: Major
>  Labels: starter
>
> Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
> `key=value` parameters with a `=` as the delimiter between the key value 
> pairs. 
> The `=` is optional and can be space delimited. We should document that both 
> methods are supported when defining these parameters.
>  
> One location within the current docs page that should be updated: 
> [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]
>  
> Code reference showing equal as an optional parameter:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34525) Update Spark Create Table DDL Docs

2021-02-24 Thread Miklos Christine (Jira)
Miklos Christine created SPARK-34525:


 Summary: Update Spark Create Table DDL Docs
 Key: SPARK-34525
 URL: https://issues.apache.org/jira/browse/SPARK-34525
 Project: Spark
  Issue Type: Improvement
  Components: docs, Documentation
Affects Versions: 3.0.3
Reporter: Miklos Christine


Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify 
`key=value` parameters with a `=` as the delimiter between the key value pairs. 
The `=` is optional and can be space delimited. We should document that both 
methods are supported when defining these parameters.

 

One location within the current docs page that should be updated: 

[https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html]

 

Code reference showing equal as an optional parameter:

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290087#comment-17290087
 ] 

Dongjoon Hyun commented on SPARK-34523:
---

I'd like to recommend to make a documentation PR instead. We already have the 
following guide in our website. You can update it from 8u92 to 8u231.

- https://spark.apache.org/docs/latest/

> Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. 

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34524) simplify v2 partition commands resolution

2021-02-24 Thread Wenchen Fan (Jira)
Wenchen Fan created SPARK-34524:
---

 Summary: simplify v2 partition commands resolution
 Key: SPARK-34524
 URL: https://issues.apache.org/jira/browse/SPARK-34524
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Dongjoon Hyun (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290084#comment-17290084
 ] 

Dongjoon Hyun commented on SPARK-34523:
---

Hi, [~Qin Yao].

This looks like a duplicate JDK information. Technically, for JDK issues, the 
Spark's affected versions (2.4 ~ 3.x) looks meaningless and misleading to me. 
Also, it's already fixed via 
[8u231|https://bugs.openjdk.java.net/issues/?jql=project+%3D+JDK+AND+fixVersion+%3D+8u231]
 . Instead of upgrading JDK, is there something for us to do?

 

cc [~srowen] and [~hyukjin.kwon]

 

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered

2021-02-24 Thread Sean R. Owen (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17290052#comment-17290052
 ] 

Sean R. Owen commented on SPARK-34448:
--

Yes I believe you're definitely correct there's a problem here. [~dbtsai] can I 
add you in here? I think you worked on the LR solver many years ago.

I skimmed the source code in sklearn and looks like the SAG solver starts with 
a 0 intercept:
https://github.com/scikit-learn/scikit-learn/blob/638b7689bbbfae4bcc4592c6f8a43ce86b571f0b/sklearn/linear_model/tests/test_sag.py#L73

Maybe ... this is the issue? I can try porting your test case to Scala to see 
if it fixes it. But the existing test suites seem to pass with a 0 initial 
intercept, at least.

> Binary logistic regression incorrectly computes the intercept and 
> coefficients when data is not centered
> 
>
> Key: SPARK-34448
> URL: https://issues.apache.org/jira/browse/SPARK-34448
> Project: Spark
>  Issue Type: Bug
>  Components: ML, MLlib
>Affects Versions: 2.4.5, 3.0.0
>Reporter: Yakov Kerzhner
>Priority: Major
>  Labels: correctness
>
> I have written up a fairly detailed gist that includes code to reproduce the 
> bug, as well as the output of the code and some commentary:
> [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96]
> To summarize: under certain conditions, the minimization that fits a binary 
> logistic regression contains a bug that pulls the intercept value towards the 
> log(odds) of the target data.  This is mathematically only correct when the 
> data comes from distributions with zero means.  In general, this gives 
> incorrect intercept values, and consequently incorrect coefficients as well.
> As I am not so familiar with the spark code base, I have not been able to 
> find this bug within the spark code itself.  A hint to this bug is here: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904]
> based on the code, I don't believe that the features have zero means at this 
> point, and so this heuristic is incorrect.  But an incorrect starting point 
> does not explain this bug.  The minimizer should drift to the correct place.  
> I was not able to find the code of the actual objective function that is 
> being minimized.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-02-24 Thread Pavel Ganelin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Pavel Ganelin updated SPARK-34521:
--
Comment: was deleted

(was: Originally submitted to ARROW: 
[ARROW-11747|https://issues.apache.org/jira/browse/ARROW-11747])

> spark.createDataFrame does not support Pandas StringDtype extension type
> 
>
> Key: SPARK-34521
> URL: https://issues.apache.org/jira/browse/SPARK-34521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Pavel Ganelin
>Priority: Major
>
> The following test case demonstrates the problem:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, types
> spark = SparkSession.builder.appName(__file__)\
> .config("spark.sql.execution.arrow.pyspark.enabled","true") \
> .getOrCreate()
> good = pd.DataFrame([["abc"]], columns=["col"])
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(good, schema=schema)
> df.show()
> bad = good.copy()
> bad["col"]=bad["col"].astype("string")
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(bad, schema=schema)
> df.show(){code}
> The error:
> {code:java}
> C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Cannot specify a mask or a size when passing an object that is converted 
> with the __arrow_array__ protocol.
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34523) JDK-8194653

2021-02-24 Thread Kent Yao (Jira)
Kent Yao created SPARK-34523:


 Summary: JDK-8194653
 Key: SPARK-34523
 URL: https://issues.apache.org/jira/browse/SPARK-34523
 Project: Spark
  Issue Type: Sub-task
  Components: Spark Core
Affects Versions: 3.0.2, 2.4.7, 3.1.1
Reporter: Kent Yao






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao resolved SPARK-34523.
--
Resolution: Information Provided

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Summary: JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault 
and System.loadLibrary call  (was: JDK-8194653)

> JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> --
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Summary: JDK-8194653:  Deadlock involving FileSystems.getDefault and 
System.loadLibrary call  (was: JDK-8194653: JDK-8194653 Deadlock involving 
FileSystems.getDefault and System.loadLibrary call)

> JDK-8194653:  Deadlock involving FileSystems.getDefault and 
> System.loadLibrary call
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34523) JDK-8194653

2021-02-24 Thread Kent Yao (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kent Yao updated SPARK-34523:
-
Attachment: 4303.log

> JDK-8194653
> ---
>
> Key: SPARK-34523
> URL: https://issues.apache.org/jira/browse/SPARK-34523
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.4.7, 3.0.2, 3.1.1
>Reporter: Kent Yao
>Priority: Major
> Attachments: 4303.log
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34522) Issue Tracker for JDK related Bugs

2021-02-24 Thread Kent Yao (Jira)
Kent Yao created SPARK-34522:


 Summary: Issue Tracker for JDK related Bugs
 Key: SPARK-34522
 URL: https://issues.apache.org/jira/browse/SPARK-34522
 Project: Spark
  Issue Type: Documentation
  Components: Documentation
Affects Versions: 3.0.2, 2.4.7, 3.1.2
Reporter: Kent Yao


This JIRA is used to log JDK-related issues that often cause Spark to throw 
weird runtime exceptions or be permanently unresponsive.

For Spark users, these issues tend to be very common but difficult to locate. 
When users encounter such questions, the JIRA may help them get quick answers 
when googling. The answers may often simply require them to upgrade the JDK 
versions.

These issues are also difficult for the community to deal with in Spark's code, 
and even maintaining documentation for troubleshooting in the code can be a 
challenge. As a distributed JVM application, problems with the JDK can take 
many forms.

So JIRA might be a good choice for documenting these problems



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-02-24 Thread Pavel Ganelin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289983#comment-17289983
 ] 

Pavel Ganelin commented on SPARK-34521:
---

Originally submitted to ARROW: 
[ARROW-11747|https://issues.apache.org/jira/browse/ARROW-11747]

> spark.createDataFrame does not support Pandas StringDtype extension type
> 
>
> Key: SPARK-34521
> URL: https://issues.apache.org/jira/browse/SPARK-34521
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 3.0.1
>Reporter: Pavel Ganelin
>Priority: Major
>
> The following test case demonstrates the problem:
> {code:java}
> import pandas as pd
> from pyspark.sql import SparkSession, types
> spark = SparkSession.builder.appName(__file__)\
> .config("spark.sql.execution.arrow.pyspark.enabled","true") \
> .getOrCreate()
> good = pd.DataFrame([["abc"]], columns=["col"])
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(good, schema=schema)
> df.show()
> bad = good.copy()
> bad["col"]=bad["col"].astype("string")
> schema = types.StructType([types.StructField("col", types.StringType(), 
> True)])
> df = spark.createDataFrame(bad, schema=schema)
> df.show(){code}
> The error:
> {code:java}
> C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
> UserWarning: createDataFrame attempted Arrow optimization because 
> 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed 
> by the reason below:
>   Cannot specify a mask or a size when passing an object that is converted 
> with the __arrow_array__ protocol.
> Attempting non-optimization as 
> 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
>   warnings.warn(msg)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type

2021-02-24 Thread Pavel Ganelin (Jira)
Pavel Ganelin created SPARK-34521:
-

 Summary: spark.createDataFrame does not support Pandas StringDtype 
extension type
 Key: SPARK-34521
 URL: https://issues.apache.org/jira/browse/SPARK-34521
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 3.0.1
Reporter: Pavel Ganelin


The following test case demonstrates the problem:
{code:java}
import pandas as pd
from pyspark.sql import SparkSession, types

spark = SparkSession.builder.appName(__file__)\
.config("spark.sql.execution.arrow.pyspark.enabled","true") \
.getOrCreate()

good = pd.DataFrame([["abc"]], columns=["col"])

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(good, schema=schema)

df.show()

bad = good.copy()
bad["col"]=bad["col"].astype("string")

schema = types.StructType([types.StructField("col", types.StringType(), True)])
df = spark.createDataFrame(bad, schema=schema)

df.show(){code}
The error:
{code:java}
C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: 
UserWarning: createDataFrame attempted Arrow optimization because 
'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by 
the reason below:
  Cannot specify a mask or a size when passing an object that is converted with 
the __arrow_array__ protocol.
Attempting non-optimization as 
'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.
  warnings.warn(msg)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-34515) Fix NPE if InSet contains null value during getPartitionsByFilter

2021-02-24 Thread Wenchen Fan (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-34515.
-
Fix Version/s: 3.1.2
   3.2.0
   Resolution: Fixed

Issue resolved by pull request 31632
[https://github.com/apache/spark/pull/31632]

> Fix NPE if InSet contains null value during getPartitionsByFilter
> -
>
> Key: SPARK-34515
> URL: https://issues.apache.org/jira/browse/SPARK-34515
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0, 3.1.2
>Reporter: ulysses you
>Assignee: ulysses you
>Priority: Minor
> Fix For: 3.2.0, 3.1.2
>
>
> Spark will convert InSet to `>= and <=` if it's values size over 
> `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning 
> partition . At this case, if values contain a null, we will get such 
> exception 
>  
> {code:java}
> java.lang.NullPointerException
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389)
>  at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50)
>  at 
> scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153)
>  at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355)
>  at java.util.TimSort.sort(TimSort.java:220)
>  at java.util.Arrays.sort(Arrays.java:1438)
>  at scala.collection.SeqLike.sorted(SeqLike.scala:659)
>  at scala.collection.SeqLike.sorted$(SeqLike.scala:647)
>  at scala.collection.AbstractSeq.sorted(Seq.scala:45)
>  at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826)
>  at scala.collection.immutable.Stream.flatMap(Stream.scala:489)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826)
>  at 
> org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848)
>  at 
> org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750)
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >