[jira] [Assigned] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22713: Assignee: (was: Apache Spark) > OOM caused by the memory contention and memory leak in TaskMemoryManager > > > Key: SPARK-22713 > URL: https://issues.apache.org/jira/browse/SPARK-22713 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.1.1, 2.1.2 >Reporter: Lijie Xu >Priority: Critical > > The pdf version of this issue with high-quality figures is available at > https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf. > *[Abstract]* > I recently encountered an OOM error in a PageRank application > (_org.apache.spark.examples.SparkPageRank_). After profiling the application, > I found the OOM error is related to the memory contention in shuffle spill > phase. Here, the memory contention means that a task tries to release some > old memory consumers from memory for keeping the new memory consumers. After > analyzing the OOM heap dump, I found the root cause is a memory leak in > _TaskMemoryManager_. Since memory contention is common in shuffle phase, this > is a critical bug/defect. In the following sections, I will use the > application dataflow, execution log, heap dump, and source code to identify > the root cause. > *[Application]* > This is a PageRank application from Spark’s example library. The following > figure shows the application dataflow. The source code is available at \[1\]. > !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%! > *[Failure symptoms]* > This application has a map stage and many iterative reduce stages. An OOM > error occurs in a reduce task (Task-28) as follows. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%! > > *[OOM root cause identification]* > Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a > time. After analyzing the application dataflow, error log, heap dump, and > source code, I found the following steps lead to the OOM error. > => The MemoryManager found that there is not enough memory to cache the > _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure). > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%! > => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the > dataflow figure). > => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle > and E2 for 2nd shuffle) in sequence. > => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of > 1st shuffle and achieves 3.3 GB. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%! > => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd > shuffle, and finding that there is not enough memory left. This triggers the > memory contention. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%! > => To handle the memory contention, the _TaskMemoryManager_ releases E1 > (spills it onto disk) and assumes that the 3.3GB space is free now. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%! > => E2 continues to aggregates the shuffled records of 2nd shuffle. However, > E2 encounters an OOM error while shuffling. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%! > *[Guess]* > The task memory usage below reveals that there is not memory drop down. So, > the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually > released by the _TaskMemoryManger_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%! > *[Root cause]* > After analyzing the heap dump, I found the guess is right (the 3.3GB > _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is > _ExternalAppendOnlyMap (E2)_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%! > *[Question]* > Why the released _ExternalAppendOnlyMap_ is still in memory? > The source code of _ExternalAppendOnlyMap_ shows that the _currentMap_ > (_AppendOnlyMap_) has been set to _null_
[jira] [Commented] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481475#comment-16481475 ] Apache Spark commented on SPARK-22713: -- User 'eyalfa' has created a pull request for this issue: https://github.com/apache/spark/pull/21369 > OOM caused by the memory contention and memory leak in TaskMemoryManager > > > Key: SPARK-22713 > URL: https://issues.apache.org/jira/browse/SPARK-22713 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.1.1, 2.1.2 >Reporter: Lijie Xu >Priority: Critical > > The pdf version of this issue with high-quality figures is available at > https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf. > *[Abstract]* > I recently encountered an OOM error in a PageRank application > (_org.apache.spark.examples.SparkPageRank_). After profiling the application, > I found the OOM error is related to the memory contention in shuffle spill > phase. Here, the memory contention means that a task tries to release some > old memory consumers from memory for keeping the new memory consumers. After > analyzing the OOM heap dump, I found the root cause is a memory leak in > _TaskMemoryManager_. Since memory contention is common in shuffle phase, this > is a critical bug/defect. In the following sections, I will use the > application dataflow, execution log, heap dump, and source code to identify > the root cause. > *[Application]* > This is a PageRank application from Spark’s example library. The following > figure shows the application dataflow. The source code is available at \[1\]. > !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%! > *[Failure symptoms]* > This application has a map stage and many iterative reduce stages. An OOM > error occurs in a reduce task (Task-28) as follows. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%! > > *[OOM root cause identification]* > Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a > time. After analyzing the application dataflow, error log, heap dump, and > source code, I found the following steps lead to the OOM error. > => The MemoryManager found that there is not enough memory to cache the > _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure). > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%! > => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the > dataflow figure). > => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle > and E2 for 2nd shuffle) in sequence. > => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of > 1st shuffle and achieves 3.3 GB. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%! > => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd > shuffle, and finding that there is not enough memory left. This triggers the > memory contention. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%! > => To handle the memory contention, the _TaskMemoryManager_ releases E1 > (spills it onto disk) and assumes that the 3.3GB space is free now. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%! > => E2 continues to aggregates the shuffled records of 2nd shuffle. However, > E2 encounters an OOM error while shuffling. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%! > *[Guess]* > The task memory usage below reveals that there is not memory drop down. So, > the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually > released by the _TaskMemoryManger_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%! > *[Root cause]* > After analyzing the heap dump, I found the guess is right (the 3.3GB > _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is > _ExternalAppendOnlyMap (E2)_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%! > *[Question]* > Why the released _ExternalAppendOnlyMap_ is still in memory? > The source code of
[jira] [Assigned] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22713: Assignee: Apache Spark > OOM caused by the memory contention and memory leak in TaskMemoryManager > > > Key: SPARK-22713 > URL: https://issues.apache.org/jira/browse/SPARK-22713 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.1.1, 2.1.2 >Reporter: Lijie Xu >Assignee: Apache Spark >Priority: Critical > > The pdf version of this issue with high-quality figures is available at > https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf. > *[Abstract]* > I recently encountered an OOM error in a PageRank application > (_org.apache.spark.examples.SparkPageRank_). After profiling the application, > I found the OOM error is related to the memory contention in shuffle spill > phase. Here, the memory contention means that a task tries to release some > old memory consumers from memory for keeping the new memory consumers. After > analyzing the OOM heap dump, I found the root cause is a memory leak in > _TaskMemoryManager_. Since memory contention is common in shuffle phase, this > is a critical bug/defect. In the following sections, I will use the > application dataflow, execution log, heap dump, and source code to identify > the root cause. > *[Application]* > This is a PageRank application from Spark’s example library. The following > figure shows the application dataflow. The source code is available at \[1\]. > !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%! > *[Failure symptoms]* > This application has a map stage and many iterative reduce stages. An OOM > error occurs in a reduce task (Task-28) as follows. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%! > > *[OOM root cause identification]* > Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a > time. After analyzing the application dataflow, error log, heap dump, and > source code, I found the following steps lead to the OOM error. > => The MemoryManager found that there is not enough memory to cache the > _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure). > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%! > => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the > dataflow figure). > => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle > and E2 for 2nd shuffle) in sequence. > => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of > 1st shuffle and achieves 3.3 GB. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%! > => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd > shuffle, and finding that there is not enough memory left. This triggers the > memory contention. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%! > => To handle the memory contention, the _TaskMemoryManager_ releases E1 > (spills it onto disk) and assumes that the 3.3GB space is free now. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%! > => E2 continues to aggregates the shuffled records of 2nd shuffle. However, > E2 encounters an OOM error while shuffling. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%! > *[Guess]* > The task memory usage below reveals that there is not memory drop down. So, > the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually > released by the _TaskMemoryManger_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%! > *[Root cause]* > After analyzing the heap dump, I found the guess is right (the 3.3GB > _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is > _ExternalAppendOnlyMap (E2)_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%! > *[Question]* > Why the released _ExternalAppendOnlyMap_ is still in memory? > The source code of _ExternalAppendOnlyMap_ shows that the _currentMap_ > (_AppendOnlyMap_)
[jira] [Commented] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager
[ https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481468#comment-16481468 ] Eyal Farago commented on SPARK-22713: - [~jerrylead], excellent investigation and description of the issue, I'll open a PR shortly. > OOM caused by the memory contention and memory leak in TaskMemoryManager > > > Key: SPARK-22713 > URL: https://issues.apache.org/jira/browse/SPARK-22713 > Project: Spark > Issue Type: Bug > Components: Shuffle, Spark Core >Affects Versions: 2.1.1, 2.1.2 >Reporter: Lijie Xu >Priority: Critical > > The pdf version of this issue with high-quality figures is available at > https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf. > *[Abstract]* > I recently encountered an OOM error in a PageRank application > (_org.apache.spark.examples.SparkPageRank_). After profiling the application, > I found the OOM error is related to the memory contention in shuffle spill > phase. Here, the memory contention means that a task tries to release some > old memory consumers from memory for keeping the new memory consumers. After > analyzing the OOM heap dump, I found the root cause is a memory leak in > _TaskMemoryManager_. Since memory contention is common in shuffle phase, this > is a critical bug/defect. In the following sections, I will use the > application dataflow, execution log, heap dump, and source code to identify > the root cause. > *[Application]* > This is a PageRank application from Spark’s example library. The following > figure shows the application dataflow. The source code is available at \[1\]. > !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%! > *[Failure symptoms]* > This application has a map stage and many iterative reduce stages. An OOM > error occurs in a reduce task (Task-28) as follows. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%! > > *[OOM root cause identification]* > Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a > time. After analyzing the application dataflow, error log, heap dump, and > source code, I found the following steps lead to the OOM error. > => The MemoryManager found that there is not enough memory to cache the > _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure). > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%! > => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the > dataflow figure). > => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle > and E2 for 2nd shuffle) in sequence. > => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of > 1st shuffle and achieves 3.3 GB. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%! > => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd > shuffle, and finding that there is not enough memory left. This triggers the > memory contention. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%! > => To handle the memory contention, the _TaskMemoryManager_ releases E1 > (spills it onto disk) and assumes that the 3.3GB space is free now. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%! > => E2 continues to aggregates the shuffled records of 2nd shuffle. However, > E2 encounters an OOM error while shuffling. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%! > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%! > *[Guess]* > The task memory usage below reveals that there is not memory drop down. So, > the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually > released by the _TaskMemoryManger_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%! > *[Root cause]* > After analyzing the heap dump, I found the guess is right (the 3.3GB > _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is > _ExternalAppendOnlyMap (E2)_. > !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%! > *[Question]* > Why the released _ExternalAppendOnlyMap_ is still in memory? > The source code of
[jira] [Resolved] (SPARK-23503) continuous execution should sequence committed epochs
[ https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23503. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20936 [https://github.com/apache/spark/pull/20936] > continuous execution should sequence committed epochs > - > > Key: SPARK-23503 > URL: https://issues.apache.org/jira/browse/SPARK-23503 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Jose Torres >Priority: Major > Fix For: 3.0.0 > > > Currently, the EpochCoordinator doesn't enforce a commit order. If a message > for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for > commit earlier, epoch n + 1 will be committed. > > This is either incorrect or needlessly confusing, because it's not safe to > start from the end offset of epoch n + 1 until epoch n is committed. > EpochCoordinator should enforce this sequencing. > > Note that this is not actually a problem right now, because the commit > messages go through the same RPC channel from the same place. But we > shouldn't implicitly bake this assumption in. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23850) We should not redact username|user|url from UI by default
[ https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-23850: -- Assignee: Marcelo Vanzin > We should not redact username|user|url from UI by default > - > > Key: SPARK-23850 > URL: https://issues.apache.org/jira/browse/SPARK-23850 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.2.2, 2.3.1, 2.4.0 > > > SPARK-22479 was filed to not print the log jdbc credentials, but in there > they also added the username and url to be redacted. I'm not sure why these > were added and to me by default these do not have security concerns. It > makes it more usable by default to be able to see these things. Users with > high security concerns can simply add them in their configs. > Also on yarn just redacting url doesn't secure anything because if you go to > the environment ui page you see all sorts of paths and really its just > confusing that some of its redacted and other parts aren't. If this was > specifically for jdbc I think it needs to be just applied there and not > broadly. > If we remove these we need to test what the jdbc driver is going to log from > SPARK-22479. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23850) We should not redact username|user|url from UI by default
[ https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-23850. Resolution: Fixed Fix Version/s: 2.3.1 2.4.0 2.2.2 Issue resolved by pull request 21365 [https://github.com/apache/spark/pull/21365] > We should not redact username|user|url from UI by default > - > > Key: SPARK-23850 > URL: https://issues.apache.org/jira/browse/SPARK-23850 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Assignee: Marcelo Vanzin >Priority: Major > Fix For: 2.2.2, 2.4.0, 2.3.1 > > > SPARK-22479 was filed to not print the log jdbc credentials, but in there > they also added the username and url to be redacted. I'm not sure why these > were added and to me by default these do not have security concerns. It > makes it more usable by default to be able to see these things. Users with > high security concerns can simply add them in their configs. > Also on yarn just redacting url doesn't secure anything because if you go to > the environment ui page you see all sorts of paths and really its just > confusing that some of its redacted and other parts aren't. If this was > specifically for jdbc I think it needs to be just applied there and not > broadly. > If we remove these we need to test what the jdbc driver is going to log from > SPARK-22479. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
[ https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481307#comment-16481307 ] Apache Spark commented on SPARK-16451: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21368 > Spark-shell / pyspark should finish gracefully when "SaslException: GSS > initiate failed" is hit > --- > > Key: SPARK-16451 > URL: https://issues.apache.org/jira/browse/SPARK-16451 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora >Priority: Major > > Steps to reproduce: (secure cluster) > * kdestroy > * spark-shell --master yarn-client > If no valid keytab is set while running spark-shell/pyspark, the spark client > never exits. It keep printing below error messages. > spark-client should call shutdown hook immediately and exit with proper error > code. > Currently, user need to explicitly shutdown process. (using cntrl+c) > {code} > 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the > server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) > at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at
[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
[ https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16451: Assignee: (was: Apache Spark) > Spark-shell / pyspark should finish gracefully when "SaslException: GSS > initiate failed" is hit > --- > > Key: SPARK-16451 > URL: https://issues.apache.org/jira/browse/SPARK-16451 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora >Priority: Major > > Steps to reproduce: (secure cluster) > * kdestroy > * spark-shell --master yarn-client > If no valid keytab is set while running spark-shell/pyspark, the spark client > never exits. It keep printing below error messages. > spark-client should call shutdown hook immediately and exit with proper error > code. > Currently, user need to explicitly shutdown process. (using cntrl+c) > {code} > 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the > server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) > at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at org.apache.spark.SparkContext.(SparkContext.scala:530) > at >
[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit
[ https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-16451: Assignee: Apache Spark > Spark-shell / pyspark should finish gracefully when "SaslException: GSS > initiate failed" is hit > --- > > Key: SPARK-16451 > URL: https://issues.apache.org/jira/browse/SPARK-16451 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Yesha Vora >Assignee: Apache Spark >Priority: Major > > Steps to reproduce: (secure cluster) > * kdestroy > * spark-shell --master yarn-client > If no valid keytab is set while running spark-shell/pyspark, the spark client > never exits. It keep printing below error messages. > spark-client should call shutdown hook immediately and exit with proper error > code. > Currently, user need to explicitly shutdown process. (using cntrl+c) > {code} > 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the > server : > javax.security.sasl.SaslException: GSS initiate failed [Caused by > GSSException: No valid credentials provided (Mechanism level: Failed to find > any Kerberos tgt)] > at > com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211) > at > org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413) > at > org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595) > at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761) > at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756) > at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397) > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617) > at org.apache.hadoop.ipc.Client.call(Client.java:1448) > at org.apache.hadoop.ipc.Client.call(Client.java:1395) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233) > at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) > at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408) > at > org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404) > at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437) > at > org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316) > at > org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194) > at > org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > at > org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127) > at > org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at org.apache.spark.SparkContext.(SparkContext.scala:530) > at >
[jira] [Assigned] (SPARK-24321) Extract common code from Divide/Remainder to a base trait
[ https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24321: Assignee: (was: Apache Spark) > Extract common code from Divide/Remainder to a base trait > - > > Key: SPARK-24321 > URL: https://issues.apache.org/jira/browse/SPARK-24321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > There's a lot of code duplication between {{Divide}} and {{Remainder}} > expression types. They're mostly the codegen template (which is exactly the > same, with just cosmetic differences), the eval function structure, etc. > It tedious to have to update multiple places in case we make improvements to > the codegen templates in the future. This ticket proposes to refactor the > duplicate code into a common base trait for these two classes. > Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} > and {{Remainder}}, so in theory we can make a deeper refactoring to > accommodate this class as well. But the "operation" part of its codegen > template is harder to factor into the base trait, so this ticket only handles > {{Divide}} and {{Remainder}} for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24321) Extract common code from Divide/Remainder to a base trait
[ https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481247#comment-16481247 ] Apache Spark commented on SPARK-24321: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/21367 > Extract common code from Divide/Remainder to a base trait > - > > Key: SPARK-24321 > URL: https://issues.apache.org/jira/browse/SPARK-24321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Priority: Minor > > There's a lot of code duplication between {{Divide}} and {{Remainder}} > expression types. They're mostly the codegen template (which is exactly the > same, with just cosmetic differences), the eval function structure, etc. > It tedious to have to update multiple places in case we make improvements to > the codegen templates in the future. This ticket proposes to refactor the > duplicate code into a common base trait for these two classes. > Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} > and {{Remainder}}, so in theory we can make a deeper refactoring to > accommodate this class as well. But the "operation" part of its codegen > template is harder to factor into the base trait, so this ticket only handles > {{Divide}} and {{Remainder}} for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24321) Extract common code from Divide/Remainder to a base trait
[ https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24321: Assignee: Apache Spark > Extract common code from Divide/Remainder to a base trait > - > > Key: SPARK-24321 > URL: https://issues.apache.org/jira/browse/SPARK-24321 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kris Mok >Assignee: Apache Spark >Priority: Minor > > There's a lot of code duplication between {{Divide}} and {{Remainder}} > expression types. They're mostly the codegen template (which is exactly the > same, with just cosmetic differences), the eval function structure, etc. > It tedious to have to update multiple places in case we make improvements to > the codegen templates in the future. This ticket proposes to refactor the > duplicate code into a common base trait for these two classes. > Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} > and {{Remainder}}, so in theory we can make a deeper refactoring to > accommodate this class as well. But the "operation" part of its codegen > template is harder to factor into the base trait, so this ticket only handles > {{Divide}} and {{Remainder}} for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18778) Fix the Scala classpath in the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-18778: -- Assignee: (was: Marcelo Vanzin) > Fix the Scala classpath in the spark-shell > -- > > Key: SPARK-18778 > URL: https://issues.apache.org/jira/browse/SPARK-18778 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: DjvuLee >Priority: Major > > Failed to initialize compiler: object scala.runtime in compiler mirror not > found. > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > Exception in thread "main" java.lang.AssertionError: assertion failed: null > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-18778) Fix the Scala classpath in the spark-shell
[ https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-18778: -- Assignee: Marcelo Vanzin > Fix the Scala classpath in the spark-shell > -- > > Key: SPARK-18778 > URL: https://issues.apache.org/jira/browse/SPARK-18778 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.1, 2.0.2 >Reporter: DjvuLee >Assignee: Marcelo Vanzin >Priority: Major > > Failed to initialize compiler: object scala.runtime in compiler mirror not > found. > ** Note that as of 2.8 scala does not assume use of the java classpath. > ** For the old behavior pass -usejavacp to scala, or if using a Settings > ** object programatically, settings.usejavacp.value = true. > Exception in thread "main" java.lang.AssertionError: assertion failed: null > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) > at > scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) > at > org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) > at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) > at org.apache.spark.repl.Main$.main(Main.scala:31) > at org.apache.spark.repl.Main.main(Main.scala) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) > at > org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) > at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) > at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) > at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24321) Extract common code from Divide/Remainder to a base trait
Kris Mok created SPARK-24321: Summary: Extract common code from Divide/Remainder to a base trait Key: SPARK-24321 URL: https://issues.apache.org/jira/browse/SPARK-24321 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: Kris Mok There's a lot of code duplication between {{Divide}} and {{Remainder}} expression types. They're mostly the codegen template (which is exactly the same, with just cosmetic differences), the eval function structure, etc. It tedious to have to update multiple places in case we make improvements to the codegen templates in the future. This ticket proposes to refactor the duplicate code into a common base trait for these two classes. Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} and {{Remainder}}, so in theory we can make a deeper refactoring to accommodate this class as well. But the "operation" part of its codegen template is harder to factor into the base trait, so this ticket only handles {{Divide}} and {{Remainder}} for now. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-17723) "log4j:WARN No appenders could be found for logger" for spark-shell --proxy-user user
[ https://issues.apache.org/jira/browse/SPARK-17723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-17723. Resolution: Cannot Reproduce This should have been fixed by SPARK-21728 (which initializes the logging system before this code runs). I tried locally and don't see that message. Please re-open if it's still an issue. > "log4j:WARN No appenders could be found for logger" for spark-shell > --proxy-user user > - > > Key: SPARK-17723 > URL: https://issues.apache.org/jira/browse/SPARK-17723 > Project: Spark > Issue Type: Improvement > Components: Spark Submit >Affects Versions: 2.1.0 >Reporter: Jacek Laskowski >Priority: Minor > > WARN messages are printed out when {{spark-shell}} starts with > {{--proxy-user}} command-line option. > {code} > $ ./bin/spark-shell --proxy-user user > log4j:WARN No appenders could be found for logger > (org.apache.hadoop.metrics2.lib.MutableMetricsFactory). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more > info. > Using Spark's default log4j profile: > org/apache/spark/log4j-defaults.properties > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > Spark context Web UI available at http://192.168.65.1:4040 > Spark context available as 'sc' (master = local[*], app id = > local-1475152321458). > Spark session available as 'spark'. > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT > /_/ > Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :quit > $ ./bin/spark-shell --version > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.1.0-SNAPSHOT > /_/ > Branch master > Compiled by user jacek on 2016-09-29T07:33:19Z > Revision 37eb9184f1e9f1c07142c66936671f4711ef407d > Url https://github.com/apache/spark.git > Type --help for more information. > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
[ https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-24308: --- Assignee: Arun Mahadevan > Handle DataReaderFactory to InputPartition renames in left over classes > --- > > Key: SPARK-24308 > URL: https://issues.apache.org/jira/browse/SPARK-24308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Assignee: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> > InputPartitionReader. Some classes still reflects the old name and causes > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes
[ https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24308. - Resolution: Fixed Fix Version/s: 2.4.0 > Handle DataReaderFactory to InputPartition renames in left over classes > --- > > Key: SPARK-24308 > URL: https://issues.apache.org/jira/browse/SPARK-24308 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.4.0 >Reporter: Arun Mahadevan >Priority: Major > Fix For: 2.4.0 > > > SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> > InputPartitionReader. Some classes still reflects the old name and causes > confusion. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24248: Assignee: Apache Spark > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Assignee: Apache Spark >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481217#comment-16481217 ] Apache Spark commented on SPARK-24248: -- User 'mccheah' has created a pull request for this issue: https://github.com/apache/spark/pull/21366 > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods
[ https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24248: Assignee: (was: Apache Spark) > [K8S] Use the Kubernetes cluster as the backing store for the state of pods > --- > > Key: SPARK-24248 > URL: https://issues.apache.org/jira/browse/SPARK-24248 > Project: Spark > Issue Type: Improvement > Components: Kubernetes >Affects Versions: 2.3.0 >Reporter: Matt Cheah >Priority: Major > > We have a number of places in KubernetesClusterSchedulerBackend right now > that maintains the state of pods in memory. However, the Kubernetes API can > always give us the most up to date and correct view of what our executors are > doing. We should consider moving away from in-memory state as much as can in > favor of using the Kubernetes cluster as the source of truth for pod status. > Maintaining less state in memory makes it so that there's a lower chance that > we accidentally miss updating one of these data structures and breaking the > lifecycle of executors. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23856) Spark jdbc setQueryTimeout option
[ https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23856. - Resolution: Fixed Assignee: Takeshi Yamamuro Fix Version/s: 2.4.0 > Spark jdbc setQueryTimeout option > - > > Key: SPARK-23856 > URL: https://issues.apache.org/jira/browse/SPARK-23856 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dmitry Mikhailov >Assignee: Takeshi Yamamuro >Priority: Minor > Fix For: 2.4.0 > > > It would be nice if a user could set the jdbc setQueryTimeout option when > running jdbc in Spark. I think it can be easily implemented by adding option > field to _JDBCOptions_ class and applying this option when initializing jdbc > statements in _JDBCRDD_ class. But only some DB vendors support this jdbc > feature. Is it worth starting a work on this option? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24149) Automatic namespaces discovery in HDFS federation
[ https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin reassigned SPARK-24149: -- Assignee: Marco Gaido > Automatic namespaces discovery in HDFS federation > - > > Key: SPARK-24149 > URL: https://issues.apache.org/jira/browse/SPARK-24149 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > Hadoop 3 introduced HDFS federation. > Spark fails to write on different namespaces when Hadoop federation is turned > on and the cluster is secure. This happens because Spark looks for the > delegation token only for the defaultFS configured and not for all the > available namespaces. A workaround is the usage of the property > {{spark.yarn.access.hadoopFileSystems}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24149) Automatic namespaces discovery in HDFS federation
[ https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-24149. Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21216 [https://github.com/apache/spark/pull/21216] > Automatic namespaces discovery in HDFS federation > - > > Key: SPARK-24149 > URL: https://issues.apache.org/jira/browse/SPARK-24149 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.4.0 >Reporter: Marco Gaido >Assignee: Marco Gaido >Priority: Minor > Fix For: 2.4.0 > > > Hadoop 3 introduced HDFS federation. > Spark fails to write on different namespaces when Hadoop federation is turned > on and the cluster is secure. This happens because Spark looks for the > delegation token only for the defaultFS configured and not for all the > available namespaces. A workaround is the usage of the property > {{spark.yarn.access.hadoopFileSystems}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-24312. - Resolution: Fixed Assignee: Dongjoon Hyun Fix Version/s: 2.4.0 > Upgrade to 2.3.3 for Hive Metastore Client 2.3 > -- > > Key: SPARK-24312 > URL: https://issues.apache.org/jira/browse/SPARK-24312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 2.4.0 > > > Hive 2.3.3 is [released on April > 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843]. > This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24320) Cannot read file names with spaces
[ https://issues.apache.org/jira/browse/SPARK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zachary Radtka updated SPARK-24320: --- Description: I am trying to read from a file on HDFS that has space in the file name, e.g. "file 1.csv" and I get a `java.io.FileNotFoundException: File does not exist` error. The versions of software I am using are: * Spark: 2.2.0.2.6.3.0-235 * Scala: version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) As an reproducible example I have the same file in HDFS named "file.csv" and "file 1.csv": {code:none} $ hdfs dfs -ls /tmp rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file 1.csv rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file.csv{code} The following script was used to successfully read from the file that does not have a space in the name: {code:java} scala> val if1 = "/tmp/file.csv" if1: String = /tmp/file.csv scala> val origTable = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("multiLine", true).option("escape", "\"").load(if1); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] scala> origTable.take(2) res3: Array[org.apache.spark.sql.Row] = Array([DATA REDACTED]) {code} The same script was used to try and read from the file that has a space in the name: {code:java} scala> val if2 = "/tmp/file 1.csv" if2: String = /tmp/file 1.csv scala> val origTable = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("multiLine", true).option("escape", "\"").load(if2); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] scala> origTable.take(2) 18/05/18 18:58:40 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) java.io.FileNotFoundException: File does not exist: /tmp/file%201.csv at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at
[jira] [Created] (SPARK-24320) Cannot read file names with spaces
Zachary Radtka created SPARK-24320: -- Summary: Cannot read file names with spaces Key: SPARK-24320 URL: https://issues.apache.org/jira/browse/SPARK-24320 Project: Spark Issue Type: Bug Components: Spark Core, SQL Affects Versions: 2.2.0 Reporter: Zachary Radtka I am trying to read from a file on HDFS that has space in the file name, e.g. "file 1.csv" and I get a `java.io.FileNotFoundException: File does not exist` error. The versions of software I am using are: * Spark: 2.2.0.2.6.3.0-235 * Scala: version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112) As an reproducible example I have the same file in HDFS named "file.csv" and "file 1.csv":( {code:none} $ hdfs dfs -ls /tmp rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file 1.csv rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file.csv{code} The following script was used to successfully read from the file that does not have a space in the name: {code} scala> val if1 = "/tmp/file.csv" if1: String = /tmp/file.csv scala> val origTable = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("multiLine", true).option("escape", "\"").load(if1); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] scala> origTable.take(2) res3: Array[org.apache.spark.sql.Row] = Array([DATA REDACTED]) {code} The same script was used to try and read from the file that has a space in the name: {code} scala> val if2 = "/tmp/file 1.csv" if2: String = /tmp/file 1.csv scala> val origTable = spark.read.format("csv").option("header", "true").option("delimiter", ",").option("multiLine", true).option("escape", "\"").load(if2); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] scala> origTable.take(2) 18/05/18 18:58:40 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8) java.io.FileNotFoundException: File does not exist: /tmp/file%201.csv at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71) at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347) It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved. at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234) at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at
[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default
[ https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481065#comment-16481065 ] Apache Spark commented on SPARK-23850: -- User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/21365 > We should not redact username|user|url from UI by default > - > > Key: SPARK-23850 > URL: https://issues.apache.org/jira/browse/SPARK-23850 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.1 >Reporter: Thomas Graves >Priority: Major > > SPARK-22479 was filed to not print the log jdbc credentials, but in there > they also added the username and url to be redacted. I'm not sure why these > were added and to me by default these do not have security concerns. It > makes it more usable by default to be able to see these things. Users with > high security concerns can simply add them in their configs. > Also on yarn just redacting url doesn't secure anything because if you go to > the environment ui page you see all sorts of paths and really its just > confusing that some of its redacted and other parts aren't. If this was > specifically for jdbc I think it needs to be just applied there and not > broadly. > If we remove these we need to test what the jdbc driver is going to log from > SPARK-22479. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481062#comment-16481062 ] H Lu commented on SPARK-23928: -- OK. I finally got it. I read the code and found classOf[Random].getName is what I want!! > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024 ] H Lu edited comment on SPARK-23928 at 5/18/18 6:43 PM: --- Dear Watchers ([~mn-mikke] [~kiszk] [~viirya]), Can someone help me with the code here? I would like to use Random. But when running tests, got the error: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: Unknown variable or type "Random" I already importe scala.util.Random. I am not yet an expert on CodeGen. So any comments would be appreciated! {code:java} s""" |for (int k = $length - 1; k >= 1; k--) { | int l = Random.nextInt(k + 1); | $swapAssigments |} """.stripMargin {code} was (Author: hzlu): Dear Watchers ([~mn-mikke] [~kiszk]), Can someone help me with the code here? I would like to use Random. But when running tests, got the error: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: Unknown variable or type "Random" I already importe scala.util.Random. I am not yet an expert on CodeGen. So any comments would be appreciated! {code:java} s""" |for (int k = $length - 1; k >= 1; k--) { | int l = Random.nextInt(k + 1); | $swapAssigments |} """.stripMargin {code} > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024 ] H Lu edited comment on SPARK-23928 at 5/18/18 6:40 PM: --- Dear Watchers ([~mn-mikke] [~kiszk]), Can someone help me with the code here? I would like to use Random. But when running tests, got the error: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: Unknown variable or type "Random" I already importe scala.util.Random. I am not yet an expert on CodeGen. So any comments would be appreciated! {code:java} s""" |for (int k = $length - 1; k >= 1; k--) { | int l = Random.nextInt(k + 1); | $swapAssigments |} """.stripMargin {code} was (Author: hzlu): Dear Watchers ([~mn-mikke] [~kiszk]), Can someone help me with the code here? I would like to use Random. But when running tests, got the error: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: Unknown variable or type "Random" I am not yet an expert on CodeGen. So any comments would be appreciated! {code:java} s""" |for (int k = $length - 1; k >= 1; k--) { | int l = Random.nextInt(k + 1); | $swapAssigments |} """.stripMargin {code} > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array
[ https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024 ] H Lu commented on SPARK-23928: -- Dear Watchers ([~mn-mikke] [~kiszk]), Can someone help me with the code here? I would like to use Random. But when running tests, got the error: Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, Column 15: Unknown variable or type "Random" I am not yet an expert on CodeGen. So any comments would be appreciated! {code:java} s""" |for (int k = $length - 1; k >= 1; k--) { | int l = Random.nextInt(k + 1); | $swapAssigments |} """.stripMargin {code} > High-order function: shuffle(x) → array > --- > > Key: SPARK-23928 > URL: https://issues.apache.org/jira/browse/SPARK-23928 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Major > > Ref: https://prestodb.io/docs/current/functions/array.html > Generate a random permutation of the given array x. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481016#comment-16481016 ] Anthony Cros commented on SPARK-14220: -- >> Lack of support for Scala 2.12 is holding back our adoption of Spark at >>Stripe Likewise here for at least my team here at the Children's Hospital of Philadelphia (CHOP). I was able to build Spark for 2.12 from source but it was a painful experience... I'd love regular updates on the progress of this as well! > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24319) run-example can not print usage
Bryan Cutler created SPARK-24319: Summary: run-example can not print usage Key: SPARK-24319 URL: https://issues.apache.org/jira/browse/SPARK-24319 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.4.0 Reporter: Bryan Cutler Running "bin/run-example" with no args or with "–help" will not print usage and just gives the error {noformat} $ bin/run-example Exception in thread "main" java.lang.IllegalArgumentException: Missing application resource. at org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296) at org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162) at org.apache.spark.launcher.Main.main(Main.java:86){noformat} it looks like there is an env var in the script that shows usage, but it's getting preempted by something else -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480981#comment-16480981 ] Andy Scott commented on SPARK-14220: Lack of support for Scala 2.12 is holding back our adoption of Spark at Stripe. Is there a good place to go to get an update on the progress and see a timeline for when a release might be available? Additionally, Scala 2.13 will be released soon. What should we expect in terms of timeline for Spark support for Scala 2.13? > Build and test Spark against Scala 2.12 > --- > > Key: SPARK-14220 > URL: https://issues.apache.org/jira/browse/SPARK-14220 > Project: Spark > Issue Type: Umbrella > Components: Build, Project Infra >Affects Versions: 2.1.0 >Reporter: Josh Rosen >Priority: Blocker > > This umbrella JIRA tracks the requirements for building and testing Spark > against the current Scala 2.12 milestone. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24159) Enable no-data micro batches for streaming mapGroupswithState
[ https://issues.apache.org/jira/browse/SPARK-24159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-24159. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21345 [https://github.com/apache/spark/pull/21345] > Enable no-data micro batches for streaming mapGroupswithState > - > > Key: SPARK-24159 > URL: https://issues.apache.org/jira/browse/SPARK-24159 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > When event-time timeout is enabled, then use watermark updates to decide > whether to run another batch > When processing-time timeout is enabled, then use the processing time and to > decide when to run more batches. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24159) Enable no-data micro batches for streaming mapGroupswithState
[ https://issues.apache.org/jira/browse/SPARK-24159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-24159: Assignee: Tathagata Das > Enable no-data micro batches for streaming mapGroupswithState > - > > Key: SPARK-24159 > URL: https://issues.apache.org/jira/browse/SPARK-24159 > Project: Spark > Issue Type: Sub-task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Tathagata Das >Assignee: Tathagata Das >Priority: Major > Fix For: 2.4.0 > > > When event-time timeout is enabled, then use watermark updates to decide > whether to run another batch > When processing-time timeout is enabled, then use the processing time and to > decide when to run more batches. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)
[ https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu reassigned SPARK-20538: Assignee: Soham Aurangabadkar > Dataset.reduce operator should use withNewExecutionId (as foreach or > foreachPartition) > -- > > Key: SPARK-20538 > URL: https://issues.apache.org/jira/browse/SPARK-20538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Assignee: Soham Aurangabadkar >Priority: Trivial > Fix For: 2.4.0 > > > {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed > in SQL tab (like {{foreach}} or {{foreachPartition}}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)
[ https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shixiong Zhu resolved SPARK-20538. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21316 [https://github.com/apache/spark/pull/21316] > Dataset.reduce operator should use withNewExecutionId (as foreach or > foreachPartition) > -- > > Key: SPARK-20538 > URL: https://issues.apache.org/jira/browse/SPARK-20538 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Jacek Laskowski >Priority: Trivial > Fix For: 2.4.0 > > > {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed > in SQL tab (like {{foreach}} or {{foreachPartition}}). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24303) Update cloudpickle to v0.4.4
[ https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler resolved SPARK-24303. -- Resolution: Fixed Fix Version/s: 2.4.0 > Update cloudpickle to v0.4.4 > > > Key: SPARK-24303 > URL: https://issues.apache.org/jira/browse/SPARK-24303 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > cloudpickle 0.4.4 is release - > https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4 > The main difference is that we are now able to pickle root logger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24303) Update cloudpickle to v0.4.4
[ https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480912#comment-16480912 ] Bryan Cutler commented on SPARK-24303: -- Issue resolved by pull request 21350 https://github.com/apache/spark/pull/21350 > Update cloudpickle to v0.4.4 > > > Key: SPARK-24303 > URL: https://issues.apache.org/jira/browse/SPARK-24303 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.4.0 > > > cloudpickle 0.4.4 is release - > https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4 > The main difference is that we are now able to pickle root logger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24303) Update cloudpickle to v0.4.4
[ https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Cutler reassigned SPARK-24303: Assignee: Hyukjin Kwon > Update cloudpickle to v0.4.4 > > > Key: SPARK-24303 > URL: https://issues.apache.org/jira/browse/SPARK-24303 > Project: Spark > Issue Type: Improvement > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > > cloudpickle 0.4.4 is release - > https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4 > The main difference is that we are now able to pickle root logger. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24318) Flaky test: SortShuffleSuite
Dongjoon Hyun created SPARK-24318: - Summary: Flaky test: SortShuffleSuite Key: SPARK-24318 URL: https://issues.apache.org/jira/browse/SPARK-24318 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Dongjoon Hyun - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/346/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/336/ {code} Error Message java.io.IOException: Failed to delete: /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905 Stacktrace sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: /home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905 at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1073) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite
[ https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-24211: -- Description: *windowed left outer join* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/ *windowed right outer join* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/ *left outer join with non-key condition violated* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/ *left outer early state exclusion on left* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375 was: *windowed left outer join* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/ *windowed right outer join* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/ - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/ *left outer join with non-key condition violated* - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/ > Flaky test: StreamingOuterJoinSuite > --- > > Key: SPARK-24211 > URL: https://issues.apache.org/jira/browse/SPARK-24211 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Priority: Major > > *windowed left outer join* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/ > *windowed right outer join* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/ > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/ > *left outer join with non-key condition violated* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/ > *left outer early state exclusion on left* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480836#comment-16480836 ] Apache Spark commented on SPARK-24317: -- User 'cxzl25' has created a pull request for this issue: https://github.com/apache/spark/pull/21364 > Float-point numbers are displayed with different precision in ThriftServer2 > --- > > Key: SPARK-24317 > URL: https://issues.apache.org/jira/browse/SPARK-24317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Minor > > When querying float-point numbers , the values displayed on beeline or jdbc > are with different precision. > {code:java} > SELECT CAST(1.23 AS FLOAT) > Result: > 1.230190734863 > {code} > According to these two jira: > [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802] > [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832] > Make a slight modification to the spark hive thrift server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24317: Assignee: Apache Spark > Float-point numbers are displayed with different precision in ThriftServer2 > --- > > Key: SPARK-24317 > URL: https://issues.apache.org/jira/browse/SPARK-24317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Assignee: Apache Spark >Priority: Minor > > When querying float-point numbers , the values displayed on beeline or jdbc > are with different precision. > {code:java} > SELECT CAST(1.23 AS FLOAT) > Result: > 1.230190734863 > {code} > According to these two jira: > [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802] > [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832] > Make a slight modification to the spark hive thrift server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2
[ https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24317: Assignee: (was: Apache Spark) > Float-point numbers are displayed with different precision in ThriftServer2 > --- > > Key: SPARK-24317 > URL: https://issues.apache.org/jira/browse/SPARK-24317 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0 >Reporter: dzcxzl >Priority: Minor > > When querying float-point numbers , the values displayed on beeline or jdbc > are with different precision. > {code:java} > SELECT CAST(1.23 AS FLOAT) > Result: > 1.230190734863 > {code} > According to these two jira: > [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802] > [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832] > Make a slight modification to the spark hive thrift server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2
dzcxzl created SPARK-24317: -- Summary: Float-point numbers are displayed with different precision in ThriftServer2 Key: SPARK-24317 URL: https://issues.apache.org/jira/browse/SPARK-24317 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0 Reporter: dzcxzl When querying float-point numbers , the values displayed on beeline or jdbc are with different precision. {code:java} SELECT CAST(1.23 AS FLOAT) Result: 1.230190734863 {code} According to these two jira: [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802] [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832] Make a slight modification to the spark hive thrift server. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table
Bimalendu Choudhary created SPARK-24316: --- Summary: Spark sql queries stall for column width more 6k for parquet based table Key: SPARK-24316 URL: https://issues.apache.org/jira/browse/SPARK-24316 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.2.1, 2.2.0 Reporter: Bimalendu Choudhary When we create a table from a data frame using spark sql with columns around 6k or more, even simple queries of fetching 70k rows takes 20 minutes, while the same table if we create through Hive with same data , the same query just takes 5 minutes. Instrumenting the code we see that the executors are looping in the while loop of the function initializeInternal(). The majority of time is getting spent here and the executor seems to be stalled for long time . [VectorizedParquetRecordReader.java|http://opengrok.sjc.cloudera.com/source/xref/spark-2.2.0-cloudera1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java] private void initializeInternal() .. .. for (int i = 0; i < requestedSchema.getFieldCount(); ++i) { ... } } When spark sql is creating table, it also stores the metadata in the TBLPROPERTIES in json format. We see that if we remove this metadata from the table the queries become fast , which is the case when we create the same table through Hive. The exact same table takes 5 times more time with the Json meta data as compared to without the json metadata. So looks like as the number of columns are growing bigger than 5 to 6k, the processing of the metadata and comparing it becomes more and more expensive and the performance degrades drastically. To recreate the problem simply run the following query: import org.apache.spark.sql.SparkSession val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7") resp_data.write.format("csv").save("/tmp/filename") The table should be created by spark sql from dataframe so that the Json meta data is stored. For ex:- val dff = spark.read.format("csv").load("hdfs:///tmp/test.csv") dff.createOrReplaceTempView("my_temp_table") val tmp = spark.sql("Create table tableName stored as parquet as select * from my_temp_table") from pyspark.sql import SQL Context sqlContext = SQLContext(sc) resp_data = spark.sql( " select * from test").limit(2000) print resp_data_fgv_1k.count() (resp_data_fgv_1k.write.option('header', False).mode('overwrite').csv('/tmp/2.csv') ) The performance seems to be even slow in the loop if the schema does not match or the fields are empty and the code goes into the if condition where the missing column is marked true: missingColumns[i] = true; -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary
[ https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480723#comment-16480723 ] Padma commented on SPARK-17557: --- I encounter an issue when data resides in Hive as parquet format and when trying to read from Spark (2.2.1), facing the above issue. I notice that in my case there is date field (contains values as 2018, 2017) which is written as integer. But when reading in spark as - val df = spark.sql("SELECT * FROM db.table") df.show(3, false) java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) To my surprise when reading same data from s3 location as - val df = spark.read.parquet("s3://path/file") df.show(3, false) // this displays the results. - Padma > SQL query on parquet table java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary > - > > Key: SPARK-17557 > URL: https://issues.apache.org/jira/browse/SPARK-17557 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Egor Pahomov >Priority: Major > > Working on 1.6.2, broken on 2.0 > {code} > select * from logs.a where year=2016 and month=9 and day=14 limit 100 > {code} > {code} > java.lang.UnsupportedOperationException: > org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary > at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48) > at > org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23667) Scala version check will fail due to launcher directory doesn't exist
[ https://issues.apache.org/jira/browse/SPARK-23667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23667. --- Resolution: Not A Problem > Scala version check will fail due to launcher directory doesn't exist > - > > Key: SPARK-23667 > URL: https://issues.apache.org/jira/browse/SPARK-23667 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Chenzhao Guo >Priority: Major > > In some cases when outer project use pre-built Spark as dependency, > {{getScalaVersion}} will fail due to {{launcher}} directory doesn't exist. > This PR also checks in {{jars}} directory. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480582#comment-16480582 ] Apache Spark commented on SPARK-19228: -- User 'sergey-rubtsov' has created a pull request for this issue: https://github.com/apache/spark/pull/21363 > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > Current FastDateFormat can't properly parse date and timestamp and does not > meet the ISO8601. > That is why there is now supporting for inferring DateType and custom > "dateFormat" option for csv parsing. > For example, I need to process user.csv like this: > {code:java} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code:java} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This mean that date processed as string and "dateFormat" option is ignored. > If I add option > {code:java} > .option("timestampFormat", "dd/MM/") > {code} > result is: > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: timestamp (nullable = true) > |-- ended: timestamp (nullable = true) > {code} > I think, the issue is somewhere in object CSVInferSchema, function > inferField, lines 80-97 and > method "tryParseDate" need to be added before/after "tryParseTimestamp", or > date/timestamp process logic need to be changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24197) add array_sort function
[ https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480514#comment-16480514 ] Apache Spark commented on SPARK-24197: -- User 'mn-mikke' has created a pull request for this issue: https://github.com/apache/spark/pull/21362 > add array_sort function > --- > > Key: SPARK-24197 > URL: https://issues.apache.org/jira/browse/SPARK-24197 > Project: Spark > Issue Type: Sub-task > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Marek Novotny >Assignee: Marek Novotny >Priority: Major > Fix For: 2.4.0 > > > Add a SparkR equivalent function to > [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored
[ https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sergey Rubtsov updated SPARK-19228: --- Description: Current FastDateFormat can't properly parse date and timestamp and does not meet the ISO8601. That is why there is now supporting for inferring DateType and custom "dateFormat" option for csv parsing. For example, I need to process user.csv like this: {code:java} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code:java} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code:java} .option("timestampFormat", "dd/MM/") {code} result is: {code:java} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. was: I need to process user.csv like this: {code} id,project,started,ended sergey.rubtsov,project0,12/12/2012,10/10/2015 {code} When I add date format options: {code} Dataset users = spark.read().format("csv").option("mode", "PERMISSIVE").option("header", "true") .option("inferSchema", "true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv"); users.printSchema(); {code} expected scheme should be {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: date (nullable = true) |-- ended: date (nullable = true) {code} but the actual result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: string (nullable = true) |-- ended: string (nullable = true) {code} This mean that date processed as string and "dateFormat" option is ignored. If I add option {code} .option("timestampFormat", "dd/MM/") {code} result is: {code} root |-- id: string (nullable = true) |-- project: string (nullable = true) |-- started: timestamp (nullable = true) |-- ended: timestamp (nullable = true) {code} I think, the issue is somewhere in object CSVInferSchema, function inferField, lines 80-97 and method "tryParseDate" need to be added before/after "tryParseTimestamp", or date/timestamp process logic need to be changed. > inferSchema function processed csv date column as string and "dateFormat" > DataSource option is ignored > -- > > Key: SPARK-19228 > URL: https://issues.apache.org/jira/browse/SPARK-19228 > Project: Spark > Issue Type: Bug > Components: Input/Output, SQL >Affects Versions: 2.1.0 >Reporter: Sergey Rubtsov >Priority: Major > Labels: easyfix > Original Estimate: 6h > Remaining Estimate: 6h > > Current FastDateFormat can't properly parse date and timestamp and does not > meet the ISO8601. > That is why there is now supporting for inferring DateType and custom > "dateFormat" option for csv parsing. > For example, I need to process user.csv like this: > {code:java} > id,project,started,ended > sergey.rubtsov,project0,12/12/2012,10/10/2015 > {code} > When I add date format options: > {code:java} > Dataset users = spark.read().format("csv").option("mode", > "PERMISSIVE").option("header", "true") > .option("inferSchema", > "true").option("dateFormat", > "dd/MM/").load("src/main/resources/user.csv"); > users.printSchema(); > {code} > expected scheme should be > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: date (nullable = true) > |-- ended: date (nullable = true) > {code} > but the actual result is: > {code:java} > root > |-- id: string (nullable = true) > |-- project: string (nullable = true) > |-- started: string (nullable = true) > |-- ended: string (nullable = true) > {code} > This
[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24313: Description: Several functions working on collection return incorrect result for complex data types in interpreted mode. In particular, we consider comple data types BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, {{array_position}}, {{element_at}} and {{GetMapValue}}. (was: The functions {{array_contains}} and {{array_position}} return incorrect result for complex data types in interpreted mode. In particular, for arrays, binarys, etc. returns always false.) > Collection functions interpreted execution doesn't work with complex types > -- > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > Several functions working on collection return incorrect result for complex > data types in interpreted mode. In particular, we consider comple data types > BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, > {{array_position}}, {{element_at}} and {{GetMapValue}}. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24313: Summary: Collection functions interpreted execution doesn't work with complex types (was: array_contains/array_position interpreted execution doesn't work with complex types) > Collection functions interpreted execution doesn't work with complex types > -- > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > The functions {{array_contains}} and {{array_position}} return incorrect > result for complex data types in interpreted mode. In particular, for arrays, > binarys, etc. returns always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480482#comment-16480482 ] Kazuaki Ishizaki commented on SPARK-24314: -- I am working for this. > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki reopened SPARK-24314: -- > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki updated SPARK-24314: - Summary: interpreted element_at or GetMapValue does not work for complex types (was: interpreted array_position does not work for complex types) > interpreted element_at or GetMapValue does not work for complex types > - > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24315) Multiple streaming jobs detected error causing job failure
Marco Gaido created SPARK-24315: --- Summary: Multiple streaming jobs detected error causing job failure Key: SPARK-24315 URL: https://issues.apache.org/jira/browse/SPARK-24315 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.3.0 Reporter: Marco Gaido We are running a simple structured streaming job. It reads data from Kafka and writes it to HDFS. Unfortunately at startup, the application fails with the following error. After some restarts the application finally starts successfully. {code} org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: Concurrent update to the log. Multiple streaming jobs detected for 1 === Streaming Query === at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: java.lang.AssertionError: assertion failed: Concurrent update to the log. Multiple streaming jobs detected for 1 at scala.Predef$.assert(Predef.scala:170) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more {code} We have not set any value for `spark.streaming.concurrentJobs`. Our code looks like: {code} // read from kafka .withWatermark("timestamp", "30 minutes") .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count() // simple select of some fields with casts .coalesce(1) .writeStream .trigger(Trigger.ProcessingTime("2 minutes")) .option("checkpointLocation", checkpointDir) // write to HDFS .start() .awaitTermination() {code} This may also be related to the presence of some data in the kafka queue to process, so the time for the first batch may be longer than usual (as it is quite common I think). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24313) array_contains/array_position interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24313: Description: The functions {{array_contains}} and {{array_position}} return incorrect result for complex data types in interpreted mode. In particular, for arrays, binarys, etc. returns always false. (was: The function {{array_contains}} returns incorrect result for complex data types in interpreted mode. In particular, for arrays, binarys, etc. returns always false.) > array_contains/array_position interpreted execution doesn't work with complex > types > --- > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > The functions {{array_contains}} and {{array_position}} return incorrect > result for complex data types in interpreted mode. In particular, for arrays, > binarys, etc. returns always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24313) array_contains/array_position interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marco Gaido updated SPARK-24313: Summary: array_contains/array_position interpreted execution doesn't work with complex types (was: array_contains interpreted execution doesn't work with complex types) > array_contains/array_position interpreted execution doesn't work with complex > types > --- > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > The function {{array_contains}} returns incorrect result for complex data > types in interpreted mode. In particular, for arrays, binarys, etc. returns > always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24314) interpreted array_position does not work for complex types
[ https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuaki Ishizaki resolved SPARK-24314. -- Resolution: Duplicate > interpreted array_position does not work for complex types > -- > > Key: SPARK-24314 > URL: https://issues.apache.org/jira/browse/SPARK-24314 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 >Reporter: Kazuaki Ishizaki >Priority: Major > > The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480454#comment-16480454 ] Morten Hornbech commented on SPARK-23614: - In the example provided caching is required to produce the bug, and I'm pretty sure aggregation is required as well > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > Fix For: 2.3.1, 2.4.0 > > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24314) interpreted array_position does not work for complex types
Kazuaki Ishizaki created SPARK-24314: Summary: interpreted array_position does not work for complex types Key: SPARK-24314 URL: https://issues.apache.org/jira/browse/SPARK-24314 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.0 Reporter: Kazuaki Ishizaki The same reason in SPARK-24313 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used
[ https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480438#comment-16480438 ] Yu-Jhe Li commented on SPARK-23614: --- Is this bug happening only when 1) cached dataframe 2) aggregation? > Union produces incorrect results when caching is used > - > > Key: SPARK-23614 > URL: https://issues.apache.org/jira/browse/SPARK-23614 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Morten Hornbech >Assignee: Liang-Chi Hsieh >Priority: Major > Labels: correctness > Fix For: 2.3.1, 2.4.0 > > > We just upgraded from 2.2 to 2.3 and our test suite caught this error: > {code:java} > case class TestData(x: Int, y: Int, z: Int) > val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, > 6))).cache() > val group1 = frame.groupBy("x").agg(min(col("y")) as "value") > val group2 = frame.groupBy("x").agg(min(col("z")) as "value") > group1.union(group2).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 2| > // | 4| 5| > // | 1| 2| > // | 4| 5| > // +---+-+ > group2.union(group1).show() > // +---+-+ > // | x|value| > // +---+-+ > // | 1| 3| > // | 4| 6| > // | 1| 3| > // | 4| 6| > // +---+-+ > {code} > The error disappears if the first data frame is not cached or if the two > group by's use separate copies. I'm not sure exactly what happens on the > insides of Spark, but errors that produce incorrect results rather than > exceptions always concerns me. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24313) array_contains interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480392#comment-16480392 ] Apache Spark commented on SPARK-24313: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/21361 > array_contains interpreted execution doesn't work with complex types > > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > The function {{array_contains}} returns incorrect result for complex data > types in interpreted mode. In particular, for arrays, binarys, etc. returns > always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24313) array_contains interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24313: Assignee: (was: Apache Spark) > array_contains interpreted execution doesn't work with complex types > > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Priority: Minor > > The function {{array_contains}} returns incorrect result for complex data > types in interpreted mode. In particular, for arrays, binarys, etc. returns > always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24313) array_contains interpreted execution doesn't work with complex types
[ https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24313: Assignee: Apache Spark > array_contains interpreted execution doesn't work with complex types > > > Key: SPARK-24313 > URL: https://issues.apache.org/jira/browse/SPARK-24313 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Minor > > The function {{array_contains}} returns incorrect result for complex data > types in interpreted mode. In particular, for arrays, binarys, etc. returns > always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24313) array_contains interpreted execution doesn't work with complex types
Marco Gaido created SPARK-24313: --- Summary: array_contains interpreted execution doesn't work with complex types Key: SPARK-24313 URL: https://issues.apache.org/jira/browse/SPARK-24313 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Marco Gaido The function {{array_contains}} returns incorrect result for complex data types in interpreted mode. In particular, for arrays, binarys, etc. returns always false. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24302) when using spark persist(),"KryoException:IndexOutOfBoundsException" happens
[ https://issues.apache.org/jira/browse/SPARK-24302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yijukang updated SPARK-24302: - Labels: apache-spark (was: ) > when using spark persist(),"KryoException:IndexOutOfBoundsException" happens > > > Key: SPARK-24302 > URL: https://issues.apache.org/jira/browse/SPARK-24302 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.6.0 >Reporter: yijukang >Priority: Major > Labels: apache-spark > > my operation is using spark to insert RDD data into Hbase like this: > -- > localData.persist() > localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration) > -- > this way throw Exception: > com.esotericsoftware.kryo.KryoException: > java.lang.IndexOutOfBoundsException:index:99, Size:6 > Serialization trace: > familyMap (org.apache.hadoop.hbase.client.Put) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42) > at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33) > at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729) > > when i deal with this: > - > localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration) > -- > it works well,what the persist() method did? > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24277) Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-24277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-24277. - Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21329 [https://github.com/apache/spark/pull/21329] > Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter > --- > > Key: SPARK-24277 > URL: https://issues.apache.org/jira/browse/SPARK-24277 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Trivial > Fix For: 2.4.0 > > > In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary > settings in hadoop configuration. > Also clean up some code in SQL module. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24277) Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter
[ https://issues.apache.org/jira/browse/SPARK-24277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-24277: --- Assignee: Gengliang Wang > Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter > --- > > Key: SPARK-24277 > URL: https://issues.apache.org/jira/browse/SPARK-24277 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Gengliang Wang >Assignee: Gengliang Wang >Priority: Trivial > Fix For: 2.4.0 > > > In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary > settings in hadoop configuration. > Also clean up some code in SQL module. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480237#comment-16480237 ] Maryann Xue commented on SPARK-24288: - Thank you for pointing this out, [~cloud_fan]! I made {{OptimizerBarrier}} inherit from {{UnaryNode}} as a proof of concept that can quickly pass the basic tests, but it was not the optimal solution. I've just created a PR, so you guys can all take a look. > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480232#comment-16480232 ] Apache Spark commented on SPARK-24288: -- User 'maryannxue' has created a pull request for this issue: https://github.com/apache/spark/pull/21360 > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24288: Assignee: (was: Apache Spark) > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24288) Enable preventing predicate pushdown
[ https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24288: Assignee: Apache Spark > Enable preventing predicate pushdown > > > Key: SPARK-24288 > URL: https://issues.apache.org/jira/browse/SPARK-24288 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.0 >Reporter: Tomasz Gawęda >Assignee: Apache Spark >Priority: Major > Attachments: SPARK-24288.simple.patch > > > Issue discussed on Mailing List: > [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html] > While working with JDBC datasource I saw that many "or" clauses with > non-equality operators causes huge performance degradation of SQL query > to database (DB2). For example: > val df = spark.read.format("jdbc").(other options to parallelize > load).load() > df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x > > 100)").show() // in real application whose predicates were pushed > many many lines below, many ANDs and ORs > If I use cache() before where, there is no predicate pushdown of this > "where" clause. However, in production system caching many sources is a > waste of memory (especially is pipeline is long and I must do cache many > times).There are also few more workarounds, but it would be great if Spark > will support preventing predicate pushdown by user. > > For example: df.withAnalysisBarrier().where(...) ? > > Note, that this should not be a global configuration option. If I read 2 > DataFrames, df1 and df2, I would like to specify that df1 should not have > some predicates pushed down, but some may be, but df2 should have all > predicates pushed down, even if target query joins df1 and df2. As far as I > understand Spark optimizer, if we use functions like `withAnalysisBarrier` > and put AnalysisBarrier explicitly in logical plan, then predicates won't be > pushed down on this particular DataFrames and PP will be still possible on > the second one. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480218#comment-16480218 ] Apache Spark commented on SPARK-24312: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/21359 > Upgrade to 2.3.3 for Hive Metastore Client 2.3 > -- > > Key: SPARK-24312 > URL: https://issues.apache.org/jira/browse/SPARK-24312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 2.3.3 is [released on April > 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843]. > This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24312: Assignee: (was: Apache Spark) > Upgrade to 2.3.3 for Hive Metastore Client 2.3 > -- > > Key: SPARK-24312 > URL: https://issues.apache.org/jira/browse/SPARK-24312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Hive 2.3.3 is [released on April > 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843]. > This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3
[ https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-24312: Assignee: Apache Spark > Upgrade to 2.3.3 for Hive Metastore Client 2.3 > -- > > Key: SPARK-24312 > URL: https://issues.apache.org/jira/browse/SPARK-24312 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Hive 2.3.3 is [released on April > 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843]. > This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3
Dongjoon Hyun created SPARK-24312: - Summary: Upgrade to 2.3.3 for Hive Metastore Client 2.3 Key: SPARK-24312 URL: https://issues.apache.org/jira/browse/SPARK-24312 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.0 Reporter: Dongjoon Hyun Hive 2.3.3 is [released on April 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843]. This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org