[jira] [Created] (SPARK-34535) Cleanup unused symbol in Orc related code
Yang Jie created SPARK-34535: Summary: Cleanup unused symbol in Orc related code Key: SPARK-34535 URL: https://issues.apache.org/jira/browse/SPARK-34535 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Yang Jie Cleanup unused symbol in Orc related code includes `OrcDeserializer`, `OrcFilters` and `OrcPartitionReaderFactory` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290744#comment-17290744 ] Apache Spark commented on SPARK-34534: -- User 'seayoun' has created a pull request for this issue: https://github.com/apache/spark/pull/31643 > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive > feature, this introduce additional problems as follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34534: Assignee: (was: Apache Spark) > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive > feature, this introduce additional problems as follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34534: Assignee: Apache Spark > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Assignee: Apache Spark >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive > feature, this introduce additional problems as follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290743#comment-17290743 ] Apache Spark commented on SPARK-34534: -- User 'seayoun' has created a pull request for this issue: https://github.com/apache/spark/pull/31643 > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive > feature, this introduce additional problems as follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290726#comment-17290726 ] Apache Spark commented on SPARK-33212: -- User 'sunchao' has created a pull request for this issue: https://github.com/apache/spark/pull/31642 > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290707#comment-17290707 ] Chao Sun commented on SPARK-33212: -- Yes. I think the only class Spark needs from this jar is {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}}, which together with other two classes it depends on from the same package, do not have Guava dependency except {{VisibleForTesting}}. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE
[ https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290676#comment-17290676 ] Apache Spark commented on SPARK-34533: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/31641 > Eliminate LEFT ANTI join to empty relation in AQE > - > > Key: SPARK-34533 > URL: https://issues.apache.org/jira/browse/SPARK-34533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > I discovered from review discussion - > [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we > can eliminate LEFT ANTI join (with no join condition) to empty relation, if > the right side is known to be non-empty. So with AQE, this is doable similar > to [https://github.com/apache/spark/pull/29484] . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE
[ https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290675#comment-17290675 ] Apache Spark commented on SPARK-34533: -- User 'c21' has created a pull request for this issue: https://github.com/apache/spark/pull/31641 > Eliminate LEFT ANTI join to empty relation in AQE > - > > Key: SPARK-34533 > URL: https://issues.apache.org/jira/browse/SPARK-34533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > I discovered from review discussion - > [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we > can eliminate LEFT ANTI join (with no join condition) to empty relation, if > the right side is known to be non-empty. So with AQE, this is doable similar > to [https://github.com/apache/spark/pull/29484] . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE
[ https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34533: Assignee: Apache Spark > Eliminate LEFT ANTI join to empty relation in AQE > - > > Key: SPARK-34533 > URL: https://issues.apache.org/jira/browse/SPARK-34533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Assignee: Apache Spark >Priority: Minor > > I discovered from review discussion - > [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we > can eliminate LEFT ANTI join (with no join condition) to empty relation, if > the right side is known to be non-empty. So with AQE, this is doable similar > to [https://github.com/apache/spark/pull/29484] . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE
[ https://issues.apache.org/jira/browse/SPARK-34533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34533: Assignee: (was: Apache Spark) > Eliminate LEFT ANTI join to empty relation in AQE > - > > Key: SPARK-34533 > URL: https://issues.apache.org/jira/browse/SPARK-34533 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Cheng Su >Priority: Minor > > I discovered from review discussion - > [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we > can eliminate LEFT ANTI join (with no join condition) to empty relation, if > the right side is known to be non-empty. So with AQE, this is doable similar > to [https://github.com/apache/spark/pull/29484] . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34520) Remove unused SecurityManager references
[ https://issues.apache.org/jira/browse/SPARK-34520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34520: - Assignee: Hyukjin Kwon > Remove unused SecurityManager references > > > Key: SPARK-34520 > URL: https://issues.apache.org/jira/browse/SPARK-34520 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > Many places of SecurityManager is not used anymore. Most of them were > introduced from SPARK-1189 but it's removed at SPARK-27004 and SPARK-33925 as > they are stale -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34520) Remove unused SecurityManager references
[ https://issues.apache.org/jira/browse/SPARK-34520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34520. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 31636 [https://github.com/apache/spark/pull/31636] > Remove unused SecurityManager references > > > Key: SPARK-34520 > URL: https://issues.apache.org/jira/browse/SPARK-34520 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.2.0 > > > Many places of SecurityManager is not used anymore. Most of them were > introduced from SPARK-1189 but it's removed at SPARK-27004 and SPARK-33925 as > they are stale -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290656#comment-17290656 ] Xiaochen Ouyang commented on SPARK-33212: - {{Maybe we should confirm there are no Guava direct refrences in hadoop-yarn-server-web-proxy module. Otherwise it will bring some Guava conflicts.}} > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
[ https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290652#comment-17290652 ] Yang Jie commented on SPARK-34529: -- There seems to be some discussion [before|https://github.com/apache/spark/pull/23080/files#r272690095] > spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" > when parsing windows line feed (CR LF) > > > Key: SPARK-34529 > URL: https://issues.apache.org/jira/browse/SPARK-34529 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.1.1, 3.0.3 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > lineSep documentation says - > `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line > separator that should be used for parsing. Maximum length is 1 character. > Reference: > > [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > When reading csv file using spark > src_df = (spark.read > .option("header", "true") > .option("multiLine","true") > .option("escape", "ǁ") > .option("lineSep","\r\n") > .schema(materialusetype_Schema) > .option("badRecordsPath","/fh_badfile") > .csv("/crlf.csv") > ) > Below is the stack trace: > java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain > only 1 character.java.lang.IllegalArgumentException: requirement failed: > 'lineSep' can contain only 1 character. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) > at > org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) > at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) > at > org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) > at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at > org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) > at org.apache.spark.sql.Dataset.withAction(Datas
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Description: We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows. `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`. This will lead to the return data not match the blockId, and this can lead to data corretness when retry to fetch after fetch block chunk failed. Fetch chunk orker code and match blockId when rerun data code as follows: !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! Howerver, the fetch order in shuffle service, !image-2021-02-25-11-30-03-834.png|width=510,height=361! So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order. !image-2021-02-25-11-31-59-110.png|width=601,height=204! was: We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows. `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`. This will lead to the return data not match the blockId, and this can lead to data corretness when retry to fetch after fetch block chunk failed. Fetch chunk orker code and match blockId when rerun data code as follows: !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! Howerver, the fetch order in shuffle service, !image-2021-02-25-11-30-03-834.png|width=510,height=361! So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order. !image-2021-02-25-11-31-59-110.png|width=601,height=204! > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive > feature, this introduce additional problems as follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Description: We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows. `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return chunk order is not same as `blockIds`. This will lead to the return data not match the blockId, and this can lead to data corretness when retry to fetch after fetch block chunk failed. Fetch chunk orker code and match blockId when rerun data code as follows: !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! Howerver, the fetch order in shuffle service, !image-2021-02-25-11-30-03-834.png|width=510,height=361! So, it will fetch some wrong block data when chunk fetch failed beause the blocks's wrong order. !image-2021-02-25-11-31-59-110.png|width=601,height=204! was: We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows. `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index. !image-2021-02-25-11-17-12-714.png|width=875,height=502! > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index, but the new `FetchShuffleBlocks` return > chunk order is not same as `blockIds`. > This will lead to the return data not match the blockId, and this can lead > to data corretness when retry to fetch after fetch block chunk failed. > Fetch chunk orker code and match blockId when rerun data code as follows: > !image-2021-02-25-11-27-34-429.png|width=446,height=251!!image-2021-02-25-11-28-31-255.png|width=445,height=159! > Howerver, the fetch order in shuffle service, > !image-2021-02-25-11-30-03-834.png|width=510,height=361! > So, it will fetch some wrong block data when chunk fetch failed beause the > blocks's wrong order. > !image-2021-02-25-11-31-59-110.png|width=601,height=204! > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Attachment: image-2021-02-25-11-31-59-110.png > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png, image-2021-02-25-11-31-59-110.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index. > !image-2021-02-25-11-17-12-714.png|width=875,height=502! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Attachment: image-2021-02-25-11-30-03-834.png > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png, > image-2021-02-25-11-30-03-834.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index. > !image-2021-02-25-11-17-12-714.png|width=875,height=502! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290650#comment-17290650 ] Dongjoon Hyun commented on SPARK-33212: --- Thanks! > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Attachment: image-2021-02-25-11-28-31-255.png > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png, image-2021-02-25-11-28-31-255.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index. > !image-2021-02-25-11-17-12-714.png|width=875,height=502! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Attachment: image-2021-02-25-11-27-34-429.png > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png, > image-2021-02-25-11-27-34-429.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index. > !image-2021-02-25-11-17-12-714.png|width=875,height=502! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'
[ https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290646#comment-17290646 ] Ted Yu commented on SPARK-34532: Included the test command and some more information in the description. You should see these errors when you run the command. > IntervalUtils.add() may result in 'long overflow' > - > > Key: SPARK-34532 > URL: https://issues.apache.org/jira/browse/SPARK-34532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2 >Reporter: Ted Yu >Priority: Major > > I noticed the following when running test suite: > build/sbt "sql/testOnly *SQLQueryTestSuite" > {code} > 19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage > 6416.0 failed 1 times; aborting job > [info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds) > 19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 6476.0 (TID 7789) > java.lang.ArithmeticException: long overflow > at java.lang.Math.multiplyExact(Math.java:892) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) > at > org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) > {code} > {code} > 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 14744.0 (TID 16705) > java.lang.ArithmeticException: long overflow > at java.lang.Math.addExact(Math.java:809) > at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105) > at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97) > {code} > This likely was caused by the following line: > {code} > val microseconds = left.microseconds + right.microseconds > {code} > We should check whether the addition would produce overflow before adding. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'
[ https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ted Yu updated SPARK-34532: --- Description: I noticed the following when running test suite: build/sbt "sql/testOnly *SQLQueryTestSuite" {code} 19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage 6416.0 failed 1 times; aborting job [info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds) 19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 in stage 6476.0 (TID 7789) java.lang.ArithmeticException: long overflow at java.lang.Math.multiplyExact(Math.java:892) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) {code} {code} 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 14744.0 (TID 16705) java.lang.ArithmeticException: long overflow at java.lang.Math.addExact(Math.java:809) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104) at org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97) {code} This likely was caused by the following line: {code} val microseconds = left.microseconds + right.microseconds {code} We should check whether the addition would produce overflow before adding. was: I noticed the following when running test suite: {code} 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 14744.0 (TID 16705) java.lang.ArithmeticException: long overflow at java.lang.Math.addExact(Math.java:809) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104) at org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97) {code} This likely was caused by the following line: {code} val microseconds = left.microseconds + right.microseconds {code} We should check whether the addition would produce overflow before adding. > IntervalUtils.add() may result in 'long overflow' > - > > Key: SPARK-34532 > URL: https://issues.apache.org/jira/browse/SPARK-34532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2 >Reporter: Ted Yu >Priority: Major > > I noticed the following when running test suite: > build/sbt "sql/testOnly *SQLQueryTestSuite" > {code} > 19:10:17.977 ERROR org.apache.spark.scheduler.TaskSetManager: Task 1 in stage > 6416.0 failed 1 times; aborting job > [info] - postgreSQL/int4.sql (2 seconds, 543 milliseconds) > 19:10:20.994 ERROR org.apache.spark.executor.Executor: Exception in task 1.0 > in stage 6476.0 (TID 7789) > java.lang.ArithmeticException: long overflow > at java.lang.Math.multiplyExact(Math.java:892) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345) > at > or
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Description: We will build a new rpc message `FetchShuffleBlocks` when `OneForOneBlockFetcher` init in replace of `OpenBlocks` to use adaptive feature, this introduce additional problems as follows. `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk fetch success, it will use index in `blockIds` to fetch blocks and match blockId in `blockIds` when chunk data return. So the `blockIds` 's order must be consistent with fetchChunk index. !image-2021-02-25-11-17-12-714.png|width=875,height=502! was: We will build a new rpc message {code:java} FetchShuffleBlocks{code} when {code:java} OneForOneBlockFetcher{code} init in replace of {code:java} OpenBlocks{code} to use adaptive feature > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png > > > We will build a new rpc message `FetchShuffleBlocks` when > `OneForOneBlockFetcher` init in replace of > `OpenBlocks` to use adaptive feature, this introduce additional problems as > follows. > `OneForOneBlockFetcher` will init a `blockIds` String array to catch chunk > fetch success, it will use index in `blockIds` to fetch blocks and match > blockId in `blockIds` when chunk data return. So the `blockIds` 's order must > be consistent with fetchChunk index. > !image-2021-02-25-11-17-12-714.png|width=875,height=502! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Attachment: image-2021-02-25-11-17-12-714.png > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > Attachments: image-2021-02-25-11-17-12-714.png > > > We will build a new rpc message > {code:java} > FetchShuffleBlocks{code} > when > {code:java} > OneForOneBlockFetcher{code} > init in replace of > {code:java} > OpenBlocks{code} > to use adaptive feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290642#comment-17290642 ] zhengruifeng commented on SPARK-34448: -- [~srowen] Thanks for pinging me, I am going to look into this issue > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yakov Kerzhner >Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Description: We will build a new rpc message {code:java} FetchShuffleBlocks{code} when {code:java} OneForOneBlockFetcher{code} init in replace of {code:java} OpenBlocks{code} to use adaptive feature > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > > We will build a new rpc message > {code:java} > FetchShuffleBlocks{code} > when > {code:java} > OneForOneBlockFetcher{code} > init in replace of > {code:java} > OpenBlocks{code} > to use adaptive feature -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34534) New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
[ https://issues.apache.org/jira/browse/SPARK-34534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haiyangyu updated SPARK-34534: -- Summary: New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness (was: FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness) > New protocol FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or > correctness > - > > Key: SPARK-34534 > URL: https://issues.apache.org/jira/browse/SPARK-34534 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 3.0.0, 3.0.1, 3.0.2 >Reporter: haiyangyu >Priority: Major > Labels: Correctness, data-loss > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34534) FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness
haiyangyu created SPARK-34534: - Summary: FetchShuffleBlocks in OneForOneBlockFetcher lead to data loss or correctness Key: SPARK-34534 URL: https://issues.apache.org/jira/browse/SPARK-34534 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 3.0.2, 3.0.1, 3.0.0 Reporter: haiyangyu -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34530) logError for interrupting block migrations is too high
[ https://issues.apache.org/jira/browse/SPARK-34530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290633#comment-17290633 ] Yang Jie commented on SPARK-34530: -- [~holden] Can you add some descriptions? > logError for interrupting block migrations is too high > -- > > Key: SPARK-34530 > URL: https://issues.apache.org/jira/browse/SPARK-34530 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.0, 3.2.0, 3.1.1 >Reporter: Holden Karau >Priority: Trivial > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'
[ https://issues.apache.org/jira/browse/SPARK-34532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290632#comment-17290632 ] Yang Jie commented on SPARK-34532: -- Which case have this problem? > IntervalUtils.add() may result in 'long overflow' > - > > Key: SPARK-34532 > URL: https://issues.apache.org/jira/browse/SPARK-34532 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2 >Reporter: Ted Yu >Priority: Major > > I noticed the following when running test suite: > {code} > 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 > in stage 14744.0 (TID 16705) > java.lang.ArithmeticException: long overflow > at java.lang.Math.addExact(Math.java:809) > at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105) > at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104) > at > org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268) > at > org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573) > at > org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97) > {code} > This likely was caused by the following line: > {code} > val microseconds = left.microseconds + right.microseconds > {code} > We should check whether the addition would produce overflow before adding. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290631#comment-17290631 ] Kent Yao edited comment on SPARK-34523 at 2/25/21, 2:56 AM: Hi [~dongjoon], thanks for your suggestions. When the problem goes to JDK, the solution is often to simply upgrade the JDK and be done with it. But I guess the hardest part for users may be to collect clues and find the corresponding problem. A documentation PR is a good choice and the detailed JIRA also helps. BTW, > Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. This statement is spark-version specific and too brief to get much users' attention. was (Author: qin yao): Hi [~dongjoon], thanks for your suggestions. When the problem goes to JDK, the solution is often to simply upgrade the JDK and be done with it. But I guess the hardest part for users may be to collect clues and find the corresponding problem. A documentation PR is a good choice and the detailed JIRA also helps. > Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. This statement is spark-version specific and too brief to get much users' attention. > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > h2. Instruction > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some of the tasks hang for hours and > all others complete without delay. > > !screenshot-2.png! > Also, you may find that these hanging tasks belong to the same executors. > Usually, in this case, you will also get nothing helpful from the executor > log. > If you print the executor jstack or you check the ThreadDump via SparkUI > executor tab and you find some task thread blocked like below, you are very > likely to hit the JDK-8194653 issue. > !screenshot-1.png! > h2. Solutions > Here are some options to circumvent this problem: > 1. For the cluster managers side, you can update the JDK version according to > https://bugs.openjdk.java.net/browse/JDK-8194653 > 2. If you are not able to update the JDK version for the cluster entirely, > you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your > apps > 2. Also, turn on `spark.speculation` may let spark automatically re-run the > hanging tasks and bypass the problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290631#comment-17290631 ] Kent Yao commented on SPARK-34523: -- Hi [~dongjoon], thanks for your suggestions. When the problem goes to JDK, the solution is often to simply upgrade the JDK and be done with it. But I guess the hardest part for users may be to collect clues and find the corresponding problem. A documentation PR is a good choice and the detailed JIRA also helps. > Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. This statement is spark-version specific and too brief to get much users' attention. > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > h2. Instruction > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some of the tasks hang for hours and > all others complete without delay. > > !screenshot-2.png! > Also, you may find that these hanging tasks belong to the same executors. > Usually, in this case, you will also get nothing helpful from the executor > log. > If you print the executor jstack or you check the ThreadDump via SparkUI > executor tab and you find some task thread blocked like below, you are very > likely to hit the JDK-8194653 issue. > !screenshot-1.png! > h2. Solutions > Here are some options to circumvent this problem: > 1. For the cluster managers side, you can update the JDK version according to > https://bugs.openjdk.java.net/browse/JDK-8194653 > 2. If you are not able to update the JDK version for the cluster entirely, > you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your > apps > 2. Also, turn on `spark.speculation` may let spark automatically re-run the > hanging tasks and bypass the problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34533) Eliminate LEFT ANTI join to empty relation in AQE
Cheng Su created SPARK-34533: Summary: Eliminate LEFT ANTI join to empty relation in AQE Key: SPARK-34533 URL: https://issues.apache.org/jira/browse/SPARK-34533 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Cheng Su I discovered from review discussion - [https://github.com/apache/spark/pull/31630#discussion_r581774000] , that we can eliminate LEFT ANTI join (with no join condition) to empty relation, if the right side is known to be non-empty. So with AQE, this is doable similar to [https://github.com/apache/spark/pull/29484] . -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun edited comment on SPARK-33212 at 2/25/21, 2:21 AM: I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think only {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. was (Author: csun): I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34497: -- Affects Version/s: (was: 3.1.2) 3.1.1 > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.2 > > > Some of the built-in JDBC connection providers are changing the JVM security > context to do the authentication which is fine. The problematic part is that > executors can be reused by another query. The following situation leads to > incorrect behaviour: > * Query1 opens JDBC connection and changes JVM security context in Executor1 > * Query2 tries to open JDBC connection but it realizes there is already an > entry for that DB type in Executor1 > * Query2 is not changing JVM security context and uses Query1 keytab and > principal > * Query2 fails with authentication error -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290613#comment-17290613 ] Chao Sun commented on SPARK-33212: -- I was able to reproduce the error in my local environment, and find a potential fix in Spark. I think {{hadoop-yarn-server-web-proxy}} is needed by Spark - all the other YARN jars are already covered by {{hadoop-client-api}} and {{hadoop-client-runtime}}. I'll open a PR for this soon. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34531: -- Issue Type: Bug (was: Improvement) > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.2 > > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34531: -- Affects Version/s: (was: 3.1.2) 3.1.1 > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2, 3.2.0, 3.1.1 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.2 > > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34531: -- Affects Version/s: 3.0.2 > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.2, 3.2.0, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.2 > > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-34531: -- Affects Version/s: 3.1.2 > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0, 3.1.2 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.2 > > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-34531: - Assignee: Hyukjin Kwon > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-34531. --- Fix Version/s: 3.1.2 Resolution: Fixed Issue resolved by pull request 31640 [https://github.com/apache/spark/pull/31640] > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.1.2 > > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Description: h2. Instruction This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some of the tasks hang for hours and all others complete without delay. !screenshot-2.png! Also, you may find that these hanging tasks belong to the same executors. Usually, in this case, you will also get nothing helpful from the executor log. If you print the executor jstack or you check the ThreadDump via SparkUI executor tab and you find some task thread blocked like below, you are very likely to hit the JDK-8194653 issue. !screenshot-1.png! h2. Solutions Here are some options to circumvent this problem: 1. For the cluster managers side, you can update the JDK version according to https://bugs.openjdk.java.net/browse/JDK-8194653 2. If you are not able to update the JDK version for the cluster entirely, you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your apps 2. Also, turn on `spark.speculation` may let spark automatically re-run the hanging tasks and bypass the problem was: h2. Instruction This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some of the tasks hang for hours and all others complete without delay. !screenshot-2.png! Also, you may find that these hanging tasks belong to the same executors. Usually, in this case, you will also get nothing helpful from the executor log. If you print the executor jstack or you check the ThreadDump via SparkUI executor tab and you find some task thread blocked like below, you are very likely to hit the JDK-8194653 issue. !screenshot-1.png! h2. Solutions 1. Update JDK version > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > h2. Instruction > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some of the tasks hang for hours and > all others complete without delay. > > !screenshot-2.png! > Also, you may find that these hanging tasks belong to the same executors. > Usually, in this case, you will also get nothing helpful from the executor > log. > If you print the executor jstack or you check the ThreadDump via SparkUI > executor tab and you find some task thread blocked like below, you are very > likely to hit the JDK-8194653 issue. > !screenshot-1.png! > h2. Solutions > Here are some options to circumvent this problem: > 1. For the cluster managers side, you can update the JDK version according to > https://bugs.openjdk.java.net/browse/JDK-8194653 > 2. If you are not able to update the JDK version for the cluster entirely, > you can use `spark.executorEnv.JAVA_HOME` to specify a suitable JRE for your > apps > 2. Also, turn on `spark.speculation` may let spark automatically re-run the > hanging tasks and bypass the problem -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34516) Spark 3.0.1 encounter parquet PageHerder IO issue
[ https://issues.apache.org/jira/browse/SPARK-34516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290599#comment-17290599 ] Dongjoon Hyun commented on SPARK-34516: --- Hi, [~angerszhuuu]. Could you provide a reproducer? For now, there is nothing much we can do. > Spark 3.0.1 encounter parquet PageHerder IO issue > - > > Key: SPARK-34516 > URL: https://issues.apache.org/jira/browse/SPARK-34516 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.1 >Reporter: angerszhu >Priority: Major > > {code:java} > Caused by: java.io.IOException: can not read class > org.apache.parquet.format.PageHeader: Required field 'uncompressed_page_size' > was not found in serialized data! Struct: > org.apache.parquet.format.PageHeader$PageHeaderStandardScheme@42a9002d > at org.apache.parquet.format.Util.read(Util.java:216) > at org.apache.parquet.format.Util.readPageHeader(Util.java:65) > at > org.apache.parquet.hadoop.ParquetFileReader$WorkaroundChunk.readPageHeader(ParquetFileReader.java:1064) > at > org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:950) > at > org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:807) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:313) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:268) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:171) > at > org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:173) > at > org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:93) > at > org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:491) > at > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Description: h2. Instruction This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some of the tasks hang for hours and all others complete without delay. !screenshot-2.png! Also, you may find that these hanging tasks belong to the same executors. Usually, in this case, you will also get nothing helpful from the executor log. If you print the executor jstack or you check the ThreadDump via SparkUI executor tab and you find some task thread blocked like below, you are very likely to hit the JDK-8194653 issue. !screenshot-1.png! h2. Solutions 1. Update JDK version was: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some of the tasks hang for hours and all others complete without delay. !screenshot-2.png! Also, you may find that these hanging tasks belong to the same executors. Usually, in this case, you will also get nothing helpful from the executor log. If you print the executor jstack or you check the ThreadDump via SparkUI executor tab and you find some task thread blocked like below, you are very likely to hit the JDK !screenshot-1.png! > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > h2. Instruction > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some of the tasks hang for hours and > all others complete without delay. > > !screenshot-2.png! > Also, you may find that these hanging tasks belong to the same executors. > Usually, in this case, you will also get nothing helpful from the executor > log. > If you print the executor jstack or you check the ThreadDump via SparkUI > executor tab and you find some task thread blocked like below, you are very > likely to hit the JDK-8194653 issue. > !screenshot-1.png! > h2. Solutions > 1. Update JDK version -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Description: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some of the tasks hang for hours and all others complete without delay. !screenshot-2.png! Also, you may find that these hanging tasks belong to the same executors. Usually, in this case, you will also get nothing helpful from the executor log. If you print the executor jstack or you check the ThreadDump via SparkUI executor tab and you find some task thread blocked like below, you are very likely to hit the JDK !screenshot-1.png! was: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some task hang for hours !screenshot-2.png! !screenshot-1.png! > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some of the tasks hang for hours and > all others complete without delay. > > !screenshot-2.png! > Also, you may find that these hanging tasks belong to the same executors. > Usually, in this case, you will also get nothing helpful from the executor > log. > If you print the executor jstack or you check the ThreadDump via SparkUI > executor tab and you find some task thread blocked like below, you are very > likely to hit the JDK > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Description: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, In the Spark UI stage tab, you may find some task hang for hours !screenshot-2.png! !screenshot-1.png! was: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, !screenshot-1.png! > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > In the Spark UI stage tab, you may find some task hang for hours > !screenshot-2.png! > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Attachment: screenshot-2.png > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png, screenshot-2.png > > > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-34497: Assignee: Gabor Somogyi > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.2 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > > Some of the built-in JDBC connection providers are changing the JVM security > context to do the authentication which is fine. The problematic part is that > executors can be reused by another query. The following situation leads to > incorrect behaviour: > * Query1 opens JDBC connection and changes JVM security context in Executor1 > * Query2 tries to open JDBC connection but it realizes there is already an > entry for that DB type in Executor1 > * Query2 is not changing JVM security context and uses Query1 keytab and > principal > * Query2 fails with authentication error -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34497) JDBC connection provider is not removing kerberos credentials from JVM security context
[ https://issues.apache.org/jira/browse/SPARK-34497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-34497. -- Fix Version/s: 3.1.2 Resolution: Fixed Issue resolved by pull request 31622 [https://github.com/apache/spark/pull/31622] > JDBC connection provider is not removing kerberos credentials from JVM > security context > --- > > Key: SPARK-34497 > URL: https://issues.apache.org/jira/browse/SPARK-34497 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2, 3.2.0, 3.1.2 >Reporter: Gabor Somogyi >Assignee: Gabor Somogyi >Priority: Major > Fix For: 3.1.2 > > > Some of the built-in JDBC connection providers are changing the JVM security > context to do the authentication which is fine. The problematic part is that > executors can be reused by another query. The following situation leads to > incorrect behaviour: > * Query1 opens JDBC connection and changes JVM security context in Executor1 > * Query2 tries to open JDBC connection but it realizes there is already an > entry for that DB type in Executor1 > * Query2 is not changing JVM security context and uses Query1 keytab and > principal > * Query2 fails with authentication error -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34532) IntervalUtils.add() may result in 'long overflow'
Ted Yu created SPARK-34532: -- Summary: IntervalUtils.add() may result in 'long overflow' Key: SPARK-34532 URL: https://issues.apache.org/jira/browse/SPARK-34532 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.2 Reporter: Ted Yu I noticed the following when running test suite: {code} 19:15:38.255 ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 14744.0 (TID 16705) java.lang.ArithmeticException: long overflow at java.lang.Math.addExact(Math.java:809) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:105) at org.apache.spark.sql.types.LongExactNumeric$.plus(numerics.scala:104) at org.apache.spark.sql.catalyst.expressions.Add.nullSafeEval(arithmetic.scala:268) at org.apache.spark.sql.catalyst.expressions.BinaryExpression.eval(Expression.scala:573) at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:97) {code} This likely was caused by the following line: {code} val microseconds = left.microseconds + right.microseconds {code} We should check whether the addition would produce overflow before adding. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Description: This will cause deadlock and hangs concurrent tasks forever on the same executor. for example, !screenshot-1.png! > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png > > > This will cause deadlock and hangs concurrent tasks forever on the same > executor. for example, > > !screenshot-1.png! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Attachment: screenshot-1.png > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log, screenshot-1.png > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34531: Assignee: Apache Spark > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34531: Assignee: (was: Apache Spark) > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290561#comment-17290561 ] Apache Spark commented on SPARK-34531: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/31640 > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
[ https://issues.apache.org/jira/browse/SPARK-34531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290563#comment-17290563 ] Apache Spark commented on SPARK-34531: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/31640 > Remove Experimental API tag in PrometheusServlet > > > Key: SPARK-34531 > URL: https://issues.apache.org/jira/browse/SPARK-34531 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Hyukjin Kwon >Priority: Major > > SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is > actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34531) Remove Experimental API tag in PrometheusServlet
Hyukjin Kwon created SPARK-34531: Summary: Remove Experimental API tag in PrometheusServlet Key: SPARK-34531 URL: https://issues.apache.org/jira/browse/SPARK-34531 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.2.0 Reporter: Hyukjin Kwon SPARK-31674 introduced an Experimental tag to PrometheusServlet but this is actually not needed because the class itself isn't an API. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34530) logError for interrupting block migrations is too high
Holden Karau created SPARK-34530: Summary: logError for interrupting block migrations is too high Key: SPARK-34530 URL: https://issues.apache.org/jira/browse/SPARK-34530 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.0, 3.2.0, 3.1.1 Reporter: Holden Karau -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
[ https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290539#comment-17290539 ] Takeshi Yamamuro commented on SPARK-34529: -- Since I think this is not a bug but improvement, I changed the type. > spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" > when parsing windows line feed (CR LF) > > > Key: SPARK-34529 > URL: https://issues.apache.org/jira/browse/SPARK-34529 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.1.1, 3.0.3 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > lineSep documentation says - > `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line > separator that should be used for parsing. Maximum length is 1 character. > Reference: > > [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > When reading csv file using spark > src_df = (spark.read > .option("header", "true") > .option("multiLine","true") > .option("escape", "ǁ") > .option("lineSep","\r\n") > .schema(materialusetype_Schema) > .option("badRecordsPath","/fh_badfile") > .csv("/crlf.csv") > ) > Below is the stack trace: > java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain > only 1 character.java.lang.IllegalArgumentException: requirement failed: > 'lineSep' can contain only 1 character. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) > at > org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) > at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) > at > org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) > at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at > org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at
[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
[ https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34529: - Component/s: (was: Spark Core) SQL > spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" > when parsing windows line feed (CR LF) > > > Key: SPARK-34529 > URL: https://issues.apache.org/jira/browse/SPARK-34529 > Project: Spark > Issue Type: Improvement > Components: PySpark, SQL >Affects Versions: 3.2.0, 3.1.1, 3.0.3 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > lineSep documentation says - > `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line > separator that should be used for parsing. Maximum length is 1 character. > Reference: > > [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > When reading csv file using spark > src_df = (spark.read > .option("header", "true") > .option("multiLine","true") > .option("escape", "ǁ") > .option("lineSep","\r\n") > .schema(materialusetype_Schema) > .option("badRecordsPath","/fh_badfile") > .csv("/crlf.csv") > ) > Below is the stack trace: > java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain > only 1 character.java.lang.IllegalArgumentException: requirement failed: > 'lineSep' can contain only 1 character. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) > at > org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) > at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) > at > org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) > at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at > org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at > org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003)
[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
[ https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34529: - Affects Version/s: (was: 3.0.1) 3.0.3 3.1.1 3.2.0 > spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" > when parsing windows line feed (CR LF) > > > Key: SPARK-34529 > URL: https://issues.apache.org/jira/browse/SPARK-34529 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.2.0, 3.1.1, 3.0.3 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > lineSep documentation says - > `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line > separator that should be used for parsing. Maximum length is 1 character. > Reference: > > [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > When reading csv file using spark > src_df = (spark.read > .option("header", "true") > .option("multiLine","true") > .option("escape", "ǁ") > .option("lineSep","\r\n") > .schema(materialusetype_Schema) > .option("badRecordsPath","/fh_badfile") > .csv("/crlf.csv") > ) > Below is the stack trace: > java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain > only 1 character.java.lang.IllegalArgumentException: requirement failed: > 'lineSep' can contain only 1 character. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) > at > org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) > at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) > at > org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) > at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at > org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:372
[jira] [Updated] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
[ https://issues.apache.org/jira/browse/SPARK-34529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro updated SPARK-34529: - Issue Type: Improvement (was: Bug) > spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" > when parsing windows line feed (CR LF) > > > Key: SPARK-34529 > URL: https://issues.apache.org/jira/browse/SPARK-34529 > Project: Spark > Issue Type: Improvement > Components: PySpark, Spark Core >Affects Versions: 3.0.1 >Reporter: Shanmugavel Kuttiyandi Chandrakasu >Priority: Minor > > lineSep documentation says - > `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line > separator that should be used for parsing. Maximum length is 1 character. > Reference: > > [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] > When reading csv file using spark > src_df = (spark.read > .option("header", "true") > .option("multiLine","true") > .option("escape", "ǁ") > .option("lineSep","\r\n") > .schema(materialusetype_Schema) > .option("badRecordsPath","/fh_badfile") > .csv("/crlf.csv") > ) > Below is the stack trace: > java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain > only 1 character.java.lang.IllegalArgumentException: requirement failed: > 'lineSep' can contain only 1 character. at > scala.Predef$.require(Predef.scala:281) at > org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) > at scala.Option.map(Option.scala:230) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at > org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at > org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) > at > org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) > at > org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) > at > org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) > at > org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) > at > org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) > at > org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) > at > org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) > at > org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) > at scala.Option.getOrElse(Option.scala:189) at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) > at > org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) > at > org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) > at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at > org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at > org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) > at > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) > at > org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) > at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at > org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) > at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at > org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003) -- This message was sent b
[jira] [Assigned] (SPARK-34528) View result are not consistent after a modification inside a struct of the table
[ https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34528: Assignee: Apache Spark > View result are not consistent after a modification inside a struct of the > table > > > Key: SPARK-34528 > URL: https://issues.apache.org/jira/browse/SPARK-34528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Prelle >Assignee: Apache Spark >Priority: Major > > After [https://github.com/apache/spark/pull/31368] work to simplify hive view > resolution > I found a bug because Hive allow you to change the order inside a struct > 1) You create a table in hive with a struct: > CREATE table test_struct (id int, sub STRUCT ); > 2) You insert data into it : > INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1"); > 3) Create a view on top of it : > CREATE view test_view_struct as select id, sub from test_view_struct > 4) Change the table struct reodoring the struct > ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>; > 5) Spark can not anymore query the view because struct in spark it's based on > the position not on the name of the column. > If the changement it's castable you can even have a silent failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34528) View result are not consistent after a modification inside a struct of the table
[ https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34528: Assignee: (was: Apache Spark) > View result are not consistent after a modification inside a struct of the > table > > > Key: SPARK-34528 > URL: https://issues.apache.org/jira/browse/SPARK-34528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Prelle >Priority: Major > > After [https://github.com/apache/spark/pull/31368] work to simplify hive view > resolution > I found a bug because Hive allow you to change the order inside a struct > 1) You create a table in hive with a struct: > CREATE table test_struct (id int, sub STRUCT ); > 2) You insert data into it : > INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1"); > 3) Create a view on top of it : > CREATE view test_view_struct as select id, sub from test_view_struct > 4) Change the table struct reodoring the struct > ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>; > 5) Spark can not anymore query the view because struct in spark it's based on > the position not on the name of the column. > If the changement it's castable you can even have a silent failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34528) View result are not consistent after a modification inside a struct of the table
[ https://issues.apache.org/jira/browse/SPARK-34528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290275#comment-17290275 ] Apache Spark commented on SPARK-34528: -- User 'tprelle' has created a pull request for this issue: https://github.com/apache/spark/pull/31639 > View result are not consistent after a modification inside a struct of the > table > > > Key: SPARK-34528 > URL: https://issues.apache.org/jira/browse/SPARK-34528 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.2.0 >Reporter: Thomas Prelle >Priority: Major > > After [https://github.com/apache/spark/pull/31368] work to simplify hive view > resolution > I found a bug because Hive allow you to change the order inside a struct > 1) You create a table in hive with a struct: > CREATE table test_struct (id int, sub STRUCT ); > 2) You insert data into it : > INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1"); > 3) Create a view on top of it : > CREATE view test_view_struct as select id, sub from test_view_struct > 4) Change the table struct reodoring the struct > ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>; > 5) Spark can not anymore query the view because struct in spark it's based on > the position not on the name of the column. > If the changement it's castable you can even have a silent failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34529) spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF)
Shanmugavel Kuttiyandi Chandrakasu created SPARK-34529: -- Summary: spark.read.csv is throwing exception ,"lineSep' can contain only 1 character" when parsing windows line feed (CR LF) Key: SPARK-34529 URL: https://issues.apache.org/jira/browse/SPARK-34529 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 3.0.1 Reporter: Shanmugavel Kuttiyandi Chandrakasu lineSep documentation says - `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator that should be used for parsing. Maximum length is 1 character. Reference: [https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameReader] When reading csv file using spark src_df = (spark.read .option("header", "true") .option("multiLine","true") .option("escape", "ǁ") .option("lineSep","\r\n") .schema(materialusetype_Schema) .option("badRecordsPath","/fh_badfile") .csv("/crlf.csv") ) Below is the stack trace: java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain only 1 character.java.lang.IllegalArgumentException: requirement failed: 'lineSep' can contain only 1 character. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.csv.CSVOptions.$anonfun$lineSeparator$1(CSVOptions.scala:209) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:207) at org.apache.spark.sql.catalyst.csv.CSVOptions.(CSVOptions.scala:58) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:108) at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:132) at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:123) at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:162) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:510) at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:497) at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:692) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:196) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:240) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:165) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:236) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:192) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:79) at org.apache.spark.sql.execution.collect.Collector$.collect(Collector.scala:88) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:61) at org.apache.spark.sql.execution.collect.InternalRowFormat$.collect(cachedSparkResults.scala:57) at org.apache.spark.sql.execution.ResultCacheManager.$anonfun$getOrComputeResultInternal$1(ResultCacheManager.scala:483) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResultInternal(ResultCacheManager.scala:483) at org.apache.spark.sql.execution.ResultCacheManager.getOrComputeResult(ResultCacheManager.scala:427) at org.apache.spark.sql.execution.CollectLimitExec.executeCollectResult(limit.scala:58) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3013) at org.apache.spark.sql.Dataset.$anonfun$collectResult$1(Dataset.scala:3004) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3728) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$5(SQLExecution.scala:116) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:248) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withCustomExecutionEnv$1(SQLExecution.scala:101) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:841) at org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:77) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:198) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3726) at org.apache.spark.sql.Dataset.collectResult(Dataset.scala:3003) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34528) View result are not consistent after a modification inside a struct of the table
Thomas Prelle created SPARK-34528: - Summary: View result are not consistent after a modification inside a struct of the table Key: SPARK-34528 URL: https://issues.apache.org/jira/browse/SPARK-34528 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.2.0 Reporter: Thomas Prelle After [https://github.com/apache/spark/pull/31368] work to simplify hive view resolution I found a bug because Hive allow you to change the order inside a struct 1) You create a table in hive with a struct: CREATE table test_struct (id int, sub STRUCT ); 2) You insert data into it : INSERT INTO TABLE test_struct select 1, named_struct("a",1,"b","v1"); 3) Create a view on top of it : CREATE view test_view_struct as select id, sub from test_view_struct 4) Change the table struct reodoring the struct ALTER TABLE test_struct CHANGE COLUMN sub sub STRUCT < b:STRING,a :INT>; 5) Spark can not anymore query the view because struct in spark it's based on the position not on the name of the column. If the changement it's castable you can even have a silent failed -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32617) Upgrade kubernetes client version to support latest minikube version.
[ https://issues.apache.org/jira/browse/SPARK-32617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-32617. -- Fix Version/s: 3.2.0 Assignee: Attila Zsolt Piros Resolution: Fixed > Upgrade kubernetes client version to support latest minikube version. > - > > Key: SPARK-32617 > URL: https://issues.apache.org/jira/browse/SPARK-32617 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.1.0 >Reporter: Prashant Sharma >Assignee: Attila Zsolt Piros >Priority: Major > Fix For: 3.2.0 > > > Following error comes, when the k8s integration tests are run against the > minikube cluster with version 1.2.1 > {code:java} > Run starting. Expected test count is: 18 > KubernetesSuite: > org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite *** ABORTED *** > io.fabric8.kubernetes.client.KubernetesClientException: An error has > occurred. > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) > at > io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:196) > at > io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:62) > at io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51) > at > io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:105) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:81) > at > org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33) > at > org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:131) > at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) > ... > Cause: java.nio.file.NoSuchFileException: /root/.minikube/apiserver.crt > at sun.nio.fs.UnixException.translateToIOException(UnixException.java:86) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > at > sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214) > at java.nio.file.Files.newByteChannel(Files.java:361) > at java.nio.file.Files.newByteChannel(Files.java:407) > at java.nio.file.Files.readAllBytes(Files.java:3152) > at > io.fabric8.kubernetes.client.internal.CertUtils.getInputStreamFromDataOrFile(CertUtils.java:72) > at > io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:242) > at > io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128) > ... > Run completed in 1 second, 821 milliseconds. > Total number of tests run: 0 > Suites: completed 1, aborted 1 > Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 > *** 1 SUITE ABORTED *** > [INFO] > > [INFO] Reactor Summary for Spark Project Parent POM 3.1.0-SNAPSHOT: > [INFO] > [INFO] Spark Project Parent POM ... SUCCESS [ 4.454 > s] > [INFO] Spark Project Tags . SUCCESS [ 4.768 > s] > [INFO] Spark Project Local DB . SUCCESS [ 2.961 > s] > [INFO] Spark Project Networking ... SUCCESS [ 4.258 > s] > [INFO] Spark Project Shuffle Streaming Service SUCCESS [ 5.703 > s] > [INFO] Spark Project Unsafe ... SUCCESS [ 3.239 > s] > [INFO] Spark Project Launcher . SUCCESS [ 3.224 > s] > [INFO] Spark Project Core . SUCCESS [02:25 > min] > [INFO] Spark Project Kubernetes Integration Tests . FAILURE [ 17.244 > s] > [INFO] > > [INFO] BUILD FAILURE > [INFO] > > [INFO] Total time: 03:12 min > [INFO] Finished at: 2020-08-11T06:26:15-05:00 > [INFO] > > [ERROR] Failed to execute goal > org.scalatest:scalatest-maven-plugin:2.0.0:test (integration-test) on project > spark-kubernetes-integration-tests_2.12: There are test failures -> [Help 1] > [ERROR] > [ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch. > [ERROR] Re-run Maven using the -X switch to enable full debug logging. > [ERROR] > [ERROR] For more information about the er
[jira] [Commented] (SPARK-34527) De-duplicated common columns cannot be resolved from USING/NATURAL JOIN
[ https://issues.apache.org/jira/browse/SPARK-34527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290211#comment-17290211 ] Karen Feng commented on SPARK-34527: I've implemented a fix for this, will push a PR. > De-duplicated common columns cannot be resolved from USING/NATURAL JOIN > --- > > Key: SPARK-34527 > URL: https://issues.apache.org/jira/browse/SPARK-34527 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Karen Feng >Priority: Minor > > USING/NATURAL JOINS today have unexpectedly asymmetric behavior when > resolving the duplicated common columns. For example, the left key columns > can be resolved from a USING INNER JOIN, but the right key columns cannot. > This is due to the Analyzer's > [rewrite|https://github.com/apache/spark/blob/999d3b89b6df14a5ccb94ffc2ffadb82964e9f7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3397] > of NATURAL/USING JOINs, which uses Project to remove the duplicated common > columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34527) De-duplicated common columns cannot be resolved from USING/NATURAL JOIN
Karen Feng created SPARK-34527: -- Summary: De-duplicated common columns cannot be resolved from USING/NATURAL JOIN Key: SPARK-34527 URL: https://issues.apache.org/jira/browse/SPARK-34527 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Karen Feng USING/NATURAL JOINS today have unexpectedly asymmetric behavior when resolving the duplicated common columns. For example, the left key columns can be resolved from a USING INNER JOIN, but the right key columns cannot. This is due to the Analyzer's [rewrite|https://github.com/apache/spark/blob/999d3b89b6df14a5ccb94ffc2ffadb82964e9f7d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L3397] of NATURAL/USING JOINs, which uses Project to remove the duplicated common columns. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34484) Rename `map` to `mapAttr` in Catalyst DSL
[ https://issues.apache.org/jira/browse/SPARK-34484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-34484: --- Summary: Rename `map` to `mapAttr` in Catalyst DSL (was: Introduce a new syntax to represent map types with the Catalyst DSL) > Rename `map` to `mapAttr` in Catalyst DSL > - > > Key: SPARK-34484 > URL: https://issues.apache.org/jira/browse/SPARK-34484 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta >Priority: Major > > With the Catalyst DSL (dsl/package.scala), we have two ways to represent > attributes. > 1. Symbol literals (`'` syntax) > 2. `$""` syntax which is defined in `sql/catalyst` module using string > context. > But they have problems. > Regarding symbol literals, the scala community deprecates the symbol literals > in Scala 2.13. We could alternatively use `Symbol` constructor but what is > worse, Scala will completely remove `Symbol` in the future > (https://scalacenter.github.io/scala-3-migration-guide/docs/incompatibilities/dropped-features.html). > {code} > Although scala.Symbol is useful for migration, beware that it is deprecated > and that it will be removed from the scala-library. You are recommended, as a > second step, to replace them with plain string literals "xwy" or a dedicated > class. > {code} > Regarding `$""` syntax, this has two problems. > The first problem is that the syntax conflicts with another `$""` syntax > defined in `sql/core` module. > You can easily see the problem with the Spark Shell. > {code} > import org.apache.spark.sql.catalyst.dsl.expressions._ > val attr1 = $"attr1" >error: type mismatch; > found : StringContext > required: ?{def $: ?} >Note that implicit conversions are not applicable because they are > ambiguous: > both method StringToColumn in class SQLImplicits of type (sc: > StringContext): spark.implicits.StringToColumn > and method StringToAttributeConversionHelper in trait > ExpressionConversions of type (sc: StringContext): > org.apache.spark.sql.catalyst.dsl.expressions.StringToAttributeConversionHelper > are possible conversion functions from StringContext to ?{def $: ?} > {code} > The second problem is that we can't write like `$"attr".map(StringType, > StringType)`, though we can write `'attr.map(StringType, StringType)`. > This seems to be a bug of the Scala compiler and will be fixed in neither > `2.12` nor `2.13` (https://github.com/scala/scala/pull/7396). > Actually, I'm working on replacing all the symbol literals with `$""` syntax > in SPARK-34443 and I found this problem in the following test code. > * EncoderResolutionSuite.scala > * ComplexTypeSuite.scala > * ObjectExpressionsSuite.scala > * NestedColumnAliasingSuite.scala > * ReplaceNullWithFalseInPredicateSuite.scala > * SimplifyCastsSuite.scala > * SimplifyConditionalSuite.scala > {code} > [error] > /home/kou/work/oss/spark-scala-2.13/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/EncoderResolutionSuite.scala:212:28: > too many arguments (found 2, expected 1) for method map: (f: > org.apache.spark.sql.catalyst.expressions.Expression => A): Seq[A] > [error] $"a".map(StringType, StringType)).foreach { attr => > {code} > So, it's better to have another way to represent attributes with DSL. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path
[ https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34526: Assignee: (was: Apache Spark) > Add a flag to skip checking file sink format and handle glob path > - > > Key: SPARK-34526 > URL: https://issues.apache.org/jira/browse/SPARK-34526 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > This ticket fixes the following issues related to file sink format checking > together: > * Some users may use a very long glob path to read and `isDirectory` may > fail when the path is too long. We should ignore the error when the path is a > glob path since the file streaming sink doesn’t support glob paths. > * Checking whether a directory is outputted by File Streaming Sink may fail > for various issues happening in the storage. We should add a flag to allow > users to disable the checking logic and read the directory as a batch output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path
[ https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290177#comment-17290177 ] Apache Spark commented on SPARK-34526: -- User 'xuanyuanking' has created a pull request for this issue: https://github.com/apache/spark/pull/31638 > Add a flag to skip checking file sink format and handle glob path > - > > Key: SPARK-34526 > URL: https://issues.apache.org/jira/browse/SPARK-34526 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > This ticket fixes the following issues related to file sink format checking > together: > * Some users may use a very long glob path to read and `isDirectory` may > fail when the path is too long. We should ignore the error when the path is a > glob path since the file streaming sink doesn’t support glob paths. > * Checking whether a directory is outputted by File Streaming Sink may fail > for various issues happening in the storage. We should add a flag to allow > users to disable the checking logic and read the directory as a batch output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path
[ https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34526: Assignee: Apache Spark > Add a flag to skip checking file sink format and handle glob path > - > > Key: SPARK-34526 > URL: https://issues.apache.org/jira/browse/SPARK-34526 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Assignee: Apache Spark >Priority: Major > > This ticket fixes the following issues related to file sink format checking > together: > * Some users may use a very long glob path to read and `isDirectory` may > fail when the path is too long. We should ignore the error when the path is a > glob path since the file streaming sink doesn’t support glob paths. > * Checking whether a directory is outputted by File Streaming Sink may fail > for various issues happening in the storage. We should add a flag to allow > users to disable the checking logic and read the directory as a batch output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path
[ https://issues.apache.org/jira/browse/SPARK-34526?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuanjian Li updated SPARK-34526: Description: This ticket fixes the following issues related to file sink format checking together: * Some users may use a very long glob path to read and `isDirectory` may fail when the path is too long. We should ignore the error when the path is a glob path since the file streaming sink doesn’t support glob paths. * Checking whether a directory is outputted by File Streaming Sink may fail for various issues happening in the storage. We should add a flag to allow users to disable the checking logic and read the directory as a batch output. was: This ticket fixes the following issues related to file sink format checking together: * Some users may use a very long glob path to read and `isDirectory`{{}} may fail when the path is too long. We should ignore the error when the path is a glob path since file streaming sink doesn’t support glob paths. * Checking whether a directory is outputted by File Streaming Sink may fail for various issues happening in the storage. We should add a flag to allow users to disable it. > Add a flag to skip checking file sink format and handle glob path > - > > Key: SPARK-34526 > URL: https://issues.apache.org/jira/browse/SPARK-34526 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.1.0 >Reporter: Yuanjian Li >Priority: Major > > This ticket fixes the following issues related to file sink format checking > together: > * Some users may use a very long glob path to read and `isDirectory` may > fail when the path is too long. We should ignore the error when the path is a > glob path since the file streaming sink doesn’t support glob paths. > * Checking whether a directory is outputted by File Streaming Sink may fail > for various issues happening in the storage. We should add a flag to allow > users to disable the checking logic and read the directory as a batch output. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34526) Add a flag to skip checking file sink format and handle glob path
Yuanjian Li created SPARK-34526: --- Summary: Add a flag to skip checking file sink format and handle glob path Key: SPARK-34526 URL: https://issues.apache.org/jira/browse/SPARK-34526 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.1.0 Reporter: Yuanjian Li This ticket fixes the following issues related to file sink format checking together: * Some users may use a very long glob path to read and `isDirectory`{{}} may fail when the path is too long. We should ignore the error when the path is a glob path since file streaming sink doesn’t support glob paths. * Checking whether a directory is outputted by File Streaming Sink may fail for various issues happening in the storage. We should add a flag to allow users to disable it. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290158#comment-17290158 ] Sean R. Owen commented on SPARK-34448: -- I crudely ported the test setup to a Scala test, and tried a 0 initial intercept in the LR implementation. It still gets the -3.5 intercept in the case where the 'const_feature' column is added, but -4 without. So, I'm not sure that's it. Let me ping [~podongfeng] or maybe even [~sethah] who have worked on that code a bit and might have more of an idea about why the intercept wouldn't quite fit right in this case. I'm wondering if there is some issue in LogisticAggregator's treatment of the intercept? no idea, this is outside my expertise. https://github.com/apache/spark/blob/3ce4ab545bfc28db7df2c559726b887b0c8c33b7/mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/LogisticAggregator.scala#L244 BTW here's my hacked up test: {code} test("BLR") { val centered = false val regParam = 1.0e-8 val num_distribution_samplings = 1000 val num_rows_per_sampling = 1000 val theta_1 = 0.3f val theta_2 = 0.2f val intercept = -4.0f val (feature1, feature2, target) = generate_blr_data(theta_1, theta_2, intercept, centered, num_distribution_samplings, num_rows_per_sampling) val num_rows = num_distribution_samplings * num_rows_per_sampling val const_feature = Array.fill(num_rows)(1.0f) (0 until num_rows / 10).foreach { i => const_feature(i) = 0.9f } val data = (0 until num_rows).map { i => (feature1(i), feature2(i), const_feature(i), target(i)) } val spark_df = spark.createDataFrame(data).toDF("feature1", "feature2", "const_feature", "label").cache() val vec = new VectorAssembler().setInputCols(Array("feature1", "feature2")).setOutputCol(("features")) val spark_df1 = vec.transform(spark_df).cache() val lr = new LogisticRegression(). setMaxIter(100).setRegParam(regParam).setElasticNetParam(0.5).setFitIntercept(true) val lrModel = lr.fit(spark_df1) println("Just the blr data") println("Coefficients: " + lrModel.coefficients) println("Intercept: " + lrModel.intercept) val vec2 = new VectorAssembler().setInputCols(Array("feature1", "feature2", "const_feature")). setOutputCol(("features")) val spark_df2 = vec2.transform(spark_df).cache() val lrModel2 = lr.fit(spark_df2) println("blr data plus one vector that is filled with 1's and .9's") println("Coefficients: " + lrModel2.coefficients) println("Intercept: " + lrModel2.intercept) } def generate_blr_data(theta_1: Float, theta_2: Float, intercept: Float, centered: Boolean, num_distribution_samplings: Int, num_rows_per_sampling: Int): (Array[Float], Array[Float], Array[Int]) = { val random = new Random(12345L) val uniforms = Array.fill(num_distribution_samplings)(random.nextFloat()) val uniforms2 = Array.fill(num_distribution_samplings)(random.nextFloat()) if (centered) { uniforms.transform(f => f - 0.5f) uniforms2.transform(f => 2.0f * f - 1.0f) } else { uniforms2.transform(f => f + 1.0f) } val h_theta = uniforms.zip(uniforms2).map { case (a, b) => intercept + theta_1 * a + theta_2 * b } val prob = h_theta.map(t => 1.0 / (1.0 + math.exp(-t))) val array = Array.ofDim[Int](num_distribution_samplings, num_rows_per_sampling) array.indices.foreach { i => (0 until math.round(num_rows_per_sampling * prob(i)).toInt).foreach { j => array(i)(j) = 1 } } val num_rows = num_distribution_samplings * num_rows_per_sampling val feature_1 = uniforms.map(f => Array.fill(num_rows_per_sampling)(f)).flatten val feature_2 = uniforms2.map(f => Array.fill(num_rows_per_sampling)(f)).flatten val target = array.flatten return (feature_1, feature_2, target) } {code} > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yakov Kerzhner >Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept
[jira] [Commented] (SPARK-33212) Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile
[ https://issues.apache.org/jira/browse/SPARK-33212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290127#comment-17290127 ] Chao Sun commented on SPARK-33212: -- Thanks again [~ouyangxc.zte]. {{org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter}} was not included in the {{hadoop-client}} jars since it is a server-side class and ideally should not be exposed to client applications such as Spark. [~dongjoon] Let me see how we can fix this either in Spark or Hadoop. > Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile > - > > Key: SPARK-33212 > URL: https://issues.apache.org/jira/browse/SPARK-33212 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Spark Submit, SQL, YARN >Affects Versions: 3.0.1 >Reporter: Chao Sun >Assignee: Chao Sun >Priority: Major > Labels: releasenotes > Fix For: 3.2.0 > > > Hadoop 3.x+ offers shaded client jars: hadoop-client-api and > hadoop-client-runtime, which shade 3rd party dependencies such as Guava, > protobuf, jetty etc. This Jira switches Spark to use these jars instead of > hadoop-common, hadoop-client etc. Benefits include: > * It unblocks Spark from upgrading to Hadoop 3.2.2/3.3.0+. The newer > versions of Hadoop have migrated to Guava 27.0+ and in order to resolve Guava > conflicts, Spark depends on Hadoop to not leaking dependencies. > * It makes Spark/Hadoop dependency cleaner. Currently Spark uses both > client-side and server-side Hadoop APIs from modules such as hadoop-common, > hadoop-yarn-server-common etc. Moving to hadoop-client-api allows use to only > use public/client API from Hadoop side. > * Provides a better isolation from Hadoop dependencies. In future Spark can > better evolve without worrying about dependencies pulled from Hadoop side > (which used to be a lot). > *There are some behavior changes introduced with this JIRA, when people use > Spark compiled with Hadoop 3.x:* > - Users now need to make sure class path contains `hadoop-client-api` and > `hadoop-client-runtime` jars when they deploy Spark with the > `hadoop-provided` option. In addition, it is high recommended that they put > these two jars before other Hadoop jars in the class path. Otherwise, > conflicts such as from Guava could happen if classes are loaded from the > other non-shaded Hadoop jars. > - Since the new shaded Hadoop clients no longer include 3rd party > dependencies. Users who used to depend on these now need to explicitly put > the jars in their class path. > Ideally the above should go to release notes. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34524) simplify v2 partition commands resolution
[ https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290120#comment-17290120 ] Apache Spark commented on SPARK-34524: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31637 > simplify v2 partition commands resolution > - > > Key: SPARK-34524 > URL: https://issues.apache.org/jira/browse/SPARK-34524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34524) simplify v2 partition commands resolution
[ https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34524: Assignee: (was: Apache Spark) > simplify v2 partition commands resolution > - > > Key: SPARK-34524 > URL: https://issues.apache.org/jira/browse/SPARK-34524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-34524) simplify v2 partition commands resolution
[ https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-34524: Assignee: Apache Spark > simplify v2 partition commands resolution > - > > Key: SPARK-34524 > URL: https://issues.apache.org/jira/browse/SPARK-34524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34524) simplify v2 partition commands resolution
[ https://issues.apache.org/jira/browse/SPARK-34524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290119#comment-17290119 ] Apache Spark commented on SPARK-34524: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/31637 > simplify v2 partition commands resolution > - > > Key: SPARK-34524 > URL: https://issues.apache.org/jira/browse/SPARK-34524 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34525) Update Spark Create Table DDL Docs
[ https://issues.apache.org/jira/browse/SPARK-34525?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-34525: Labels: starter (was: ) > Update Spark Create Table DDL Docs > -- > > Key: SPARK-34525 > URL: https://issues.apache.org/jira/browse/SPARK-34525 > Project: Spark > Issue Type: Improvement > Components: docs, Documentation >Affects Versions: 3.0.3 >Reporter: Miklos Christine >Priority: Major > Labels: starter > > Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify > `key=value` parameters with a `=` as the delimiter between the key value > pairs. > The `=` is optional and can be space delimited. We should document that both > methods are supported when defining these parameters. > > One location within the current docs page that should be updated: > [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html] > > Code reference showing equal as an optional parameter: > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34525) Update Spark Create Table DDL Docs
Miklos Christine created SPARK-34525: Summary: Update Spark Create Table DDL Docs Key: SPARK-34525 URL: https://issues.apache.org/jira/browse/SPARK-34525 Project: Spark Issue Type: Improvement Components: docs, Documentation Affects Versions: 3.0.3 Reporter: Miklos Christine Within the `CREATE TABLE` docs, the `OPTIONS` and `TBLPROPERTIES`specify `key=value` parameters with a `=` as the delimiter between the key value pairs. The `=` is optional and can be space delimited. We should document that both methods are supported when defining these parameters. One location within the current docs page that should be updated: [https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-datasource.html] Code reference showing equal as an optional parameter: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4#L401 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290087#comment-17290087 ] Dongjoon Hyun commented on SPARK-34523: --- I'd like to recommend to make a documentation PR instead. We already have the following guide in our website. You can update it from 8u92 to 8u231. - https://spark.apache.org/docs/latest/ > Java 8 prior to version 8u92 support is deprecated as of Spark 3.0.0. > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34524) simplify v2 partition commands resolution
Wenchen Fan created SPARK-34524: --- Summary: simplify v2 partition commands resolution Key: SPARK-34524 URL: https://issues.apache.org/jira/browse/SPARK-34524 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0 Reporter: Wenchen Fan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290084#comment-17290084 ] Dongjoon Hyun commented on SPARK-34523: --- Hi, [~Qin Yao]. This looks like a duplicate JDK information. Technically, for JDK issues, the Spark's affected versions (2.4 ~ 3.x) looks meaningless and misleading to me. Also, it's already fixed via [8u231|https://bugs.openjdk.java.net/issues/?jql=project+%3D+JDK+AND+fixVersion+%3D+8u231] . Instead of upgrading JDK, is there something for us to do? cc [~srowen] and [~hyukjin.kwon] > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34448) Binary logistic regression incorrectly computes the intercept and coefficients when data is not centered
[ https://issues.apache.org/jira/browse/SPARK-34448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17290052#comment-17290052 ] Sean R. Owen commented on SPARK-34448: -- Yes I believe you're definitely correct there's a problem here. [~dbtsai] can I add you in here? I think you worked on the LR solver many years ago. I skimmed the source code in sklearn and looks like the SAG solver starts with a 0 intercept: https://github.com/scikit-learn/scikit-learn/blob/638b7689bbbfae4bcc4592c6f8a43ce86b571f0b/sklearn/linear_model/tests/test_sag.py#L73 Maybe ... this is the issue? I can try porting your test case to Scala to see if it fixes it. But the existing test suites seem to pass with a 0 initial intercept, at least. > Binary logistic regression incorrectly computes the intercept and > coefficients when data is not centered > > > Key: SPARK-34448 > URL: https://issues.apache.org/jira/browse/SPARK-34448 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.4.5, 3.0.0 >Reporter: Yakov Kerzhner >Priority: Major > Labels: correctness > > I have written up a fairly detailed gist that includes code to reproduce the > bug, as well as the output of the code and some commentary: > [https://gist.github.com/ykerzhner/51358780a6a4cc33266515f17bf98a96] > To summarize: under certain conditions, the minimization that fits a binary > logistic regression contains a bug that pulls the intercept value towards the > log(odds) of the target data. This is mathematically only correct when the > data comes from distributions with zero means. In general, this gives > incorrect intercept values, and consequently incorrect coefficients as well. > As I am not so familiar with the spark code base, I have not been able to > find this bug within the spark code itself. A hint to this bug is here: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala#L894-L904] > based on the code, I don't believe that the features have zero means at this > point, and so this heuristic is incorrect. But an incorrect starting point > does not explain this bug. The minimizer should drift to the correct place. > I was not able to find the code of the actual objective function that is > being minimized. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type
[ https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Pavel Ganelin updated SPARK-34521: -- Comment: was deleted (was: Originally submitted to ARROW: [ARROW-11747|https://issues.apache.org/jira/browse/ARROW-11747]) > spark.createDataFrame does not support Pandas StringDtype extension type > > > Key: SPARK-34521 > URL: https://issues.apache.org/jira/browse/SPARK-34521 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Pavel Ganelin >Priority: Major > > The following test case demonstrates the problem: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, types > spark = SparkSession.builder.appName(__file__)\ > .config("spark.sql.execution.arrow.pyspark.enabled","true") \ > .getOrCreate() > good = pd.DataFrame([["abc"]], columns=["col"]) > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(good, schema=schema) > df.show() > bad = good.copy() > bad["col"]=bad["col"].astype("string") > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(bad, schema=schema) > df.show(){code} > The error: > {code:java} > C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Cannot specify a mask or a size when passing an object that is converted > with the __arrow_array__ protocol. > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34523) JDK-8194653
Kent Yao created SPARK-34523: Summary: JDK-8194653 Key: SPARK-34523 URL: https://issues.apache.org/jira/browse/SPARK-34523 Project: Spark Issue Type: Sub-task Components: Spark Core Affects Versions: 3.0.2, 2.4.7, 3.1.1 Reporter: Kent Yao -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao resolved SPARK-34523. -- Resolution: Information Provided > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Summary: JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and System.loadLibrary call (was: JDK-8194653) > JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and > System.loadLibrary call > -- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Summary: JDK-8194653: Deadlock involving FileSystems.getDefault and System.loadLibrary call (was: JDK-8194653: JDK-8194653 Deadlock involving FileSystems.getDefault and System.loadLibrary call) > JDK-8194653: Deadlock involving FileSystems.getDefault and > System.loadLibrary call > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-34523) JDK-8194653
[ https://issues.apache.org/jira/browse/SPARK-34523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kent Yao updated SPARK-34523: - Attachment: 4303.log > JDK-8194653 > --- > > Key: SPARK-34523 > URL: https://issues.apache.org/jira/browse/SPARK-34523 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.4.7, 3.0.2, 3.1.1 >Reporter: Kent Yao >Priority: Major > Attachments: 4303.log > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34522) Issue Tracker for JDK related Bugs
Kent Yao created SPARK-34522: Summary: Issue Tracker for JDK related Bugs Key: SPARK-34522 URL: https://issues.apache.org/jira/browse/SPARK-34522 Project: Spark Issue Type: Documentation Components: Documentation Affects Versions: 3.0.2, 2.4.7, 3.1.2 Reporter: Kent Yao This JIRA is used to log JDK-related issues that often cause Spark to throw weird runtime exceptions or be permanently unresponsive. For Spark users, these issues tend to be very common but difficult to locate. When users encounter such questions, the JIRA may help them get quick answers when googling. The answers may often simply require them to upgrade the JDK versions. These issues are also difficult for the community to deal with in Spark's code, and even maintaining documentation for troubleshooting in the code can be a challenge. As a distributed JVM application, problems with the JDK can take many forms. So JIRA might be a good choice for documenting these problems -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type
[ https://issues.apache.org/jira/browse/SPARK-34521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289983#comment-17289983 ] Pavel Ganelin commented on SPARK-34521: --- Originally submitted to ARROW: [ARROW-11747|https://issues.apache.org/jira/browse/ARROW-11747] > spark.createDataFrame does not support Pandas StringDtype extension type > > > Key: SPARK-34521 > URL: https://issues.apache.org/jira/browse/SPARK-34521 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.0.1 >Reporter: Pavel Ganelin >Priority: Major > > The following test case demonstrates the problem: > {code:java} > import pandas as pd > from pyspark.sql import SparkSession, types > spark = SparkSession.builder.appName(__file__)\ > .config("spark.sql.execution.arrow.pyspark.enabled","true") \ > .getOrCreate() > good = pd.DataFrame([["abc"]], columns=["col"]) > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(good, schema=schema) > df.show() > bad = good.copy() > bad["col"]=bad["col"].astype("string") > schema = types.StructType([types.StructField("col", types.StringType(), > True)]) > df = spark.createDataFrame(bad, schema=schema) > df.show(){code} > The error: > {code:java} > C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: > UserWarning: createDataFrame attempted Arrow optimization because > 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed > by the reason below: > Cannot specify a mask or a size when passing an object that is converted > with the __arrow_array__ protocol. > Attempting non-optimization as > 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. > warnings.warn(msg) > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-34521) spark.createDataFrame does not support Pandas StringDtype extension type
Pavel Ganelin created SPARK-34521: - Summary: spark.createDataFrame does not support Pandas StringDtype extension type Key: SPARK-34521 URL: https://issues.apache.org/jira/browse/SPARK-34521 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 3.0.1 Reporter: Pavel Ganelin The following test case demonstrates the problem: {code:java} import pandas as pd from pyspark.sql import SparkSession, types spark = SparkSession.builder.appName(__file__)\ .config("spark.sql.execution.arrow.pyspark.enabled","true") \ .getOrCreate() good = pd.DataFrame([["abc"]], columns=["col"]) schema = types.StructType([types.StructField("col", types.StringType(), True)]) df = spark.createDataFrame(good, schema=schema) df.show() bad = good.copy() bad["col"]=bad["col"].astype("string") schema = types.StructType([types.StructField("col", types.StringType(), True)]) df = spark.createDataFrame(bad, schema=schema) df.show(){code} The error: {code:java} C:\Python\3.8.3\lib\site-packages\pyspark\sql\pandas\conversion.py:289: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Cannot specify a mask or a size when passing an object that is converted with the __arrow_array__ protocol. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org