date:20180518

[jira] [Assigned] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22713:


Assignee: (was: Apache Spark)

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Priority: Critical
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
> The source code of _ExternalAppendOnlyMap_ shows that the _currentMap_ 
> (_AppendOnlyMap_) has been set to _null_

[jira] [Commented] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481475#comment-16481475
 ] 

Apache Spark commented on SPARK-22713:
--

User 'eyalfa' has created a pull request for this issue:
https://github.com/apache/spark/pull/21369

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Priority: Critical
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
> The source code of

[jira] [Assigned] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22713:


Assignee: Apache Spark

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Assignee: Apache Spark
>Priority: Critical
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
> The source code of _ExternalAppendOnlyMap_ shows that the _currentMap_ 
> (_AppendOnlyMap_)

[jira] [Commented] (SPARK-22713) OOM caused by the memory contention and memory leak in TaskMemoryManager

2018-05-18 Thread Eyal Farago (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481468#comment-16481468
 ] 

Eyal Farago commented on SPARK-22713:
-

[~jerrylead], excellent investigation and description of the issue, I'll open a 
PR shortly.

> OOM caused by the memory contention and memory leak in TaskMemoryManager
> 
>
> Key: SPARK-22713
> URL: https://issues.apache.org/jira/browse/SPARK-22713
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.1, 2.1.2
>Reporter: Lijie Xu
>Priority: Critical
>
> The pdf version of this issue with high-quality figures is available at 
> https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/report/OOM-TaskMemoryManager.pdf.
> *[Abstract]* 
> I recently encountered an OOM error in a PageRank application 
> (_org.apache.spark.examples.SparkPageRank_). After profiling the application, 
> I found the OOM error is related to the memory contention in shuffle spill 
> phase. Here, the memory contention means that a task tries to release some 
> old memory consumers from memory for keeping the new memory consumers. After 
> analyzing the OOM heap dump, I found the root cause is a memory leak in 
> _TaskMemoryManager_. Since memory contention is common in shuffle phase, this 
> is a critical bug/defect. In the following sections, I will use the 
> application dataflow, execution log, heap dump, and source code to identify 
> the root cause.
> *[Application]* 
> This is a PageRank application from Spark’s example library. The following 
> figure shows the application dataflow. The source code is available at \[1\].
> !https://raw.githubusercontent.com/JerryLead/Misc/master/OOM-TasksMemoryManager/figures/PageRankDataflow.png|width=100%!
> *[Failure symptoms]*
> This application has a map stage and many iterative reduce stages. An OOM 
> error occurs in a reduce task (Task-28) as follows.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/Stage.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/task.png?raw=true|width=100%!
>  
> *[OOM root cause identification]*
> Each executor has 1 CPU core and 6.5GB memory, so it only runs one task at a 
> time. After analyzing the application dataflow, error log, heap dump, and 
> source code, I found the following steps lead to the OOM error. 
> => The MemoryManager found that there is not enough memory to cache the 
> _links:ShuffledRDD_ (rdd-5-28, red circles in the dataflow figure).
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/ShuffledRDD.png?raw=true|width=100%!
> => The task needs to shuffle twice (1st shuffle and 2nd shuffle in the 
> dataflow figure).
> => The task needs to generate two _ExternalAppendOnlyMap_ (E1 for 1st shuffle 
> and E2 for 2nd shuffle) in sequence.
> => The 1st shuffle begins and ends. E1 aggregates all the shuffled data of 
> 1st shuffle and achieves 3.3 GB.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/FirstShuffle.png?raw=true|width=100%!
> => The 2nd shuffle begins. E2 is aggregating the shuffled data of 2nd 
> shuffle, and finding that there is not enough memory left. This triggers the 
> memory contention.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/SecondShuffle.png?raw=true|width=100%!
> => To handle the memory contention, the _TaskMemoryManager_ releases E1 
> (spills it onto disk) and assumes that the 3.3GB space is free now.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/MemoryContention.png?raw=true|width=100%!
> => E2 continues to aggregates the shuffled records of 2nd shuffle. However, 
> E2 encounters an OOM error while shuffling.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMbefore.png?raw=true|width=100%!
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/OOMError.png?raw=true|width=100%!
> *[Guess]* 
> The task memory usage below reveals that there is not memory drop down. So, 
> the cause may be that the 3.3GB _ExternalAppendOnlyMap_ (E1) is not actually 
> released by the _TaskMemoryManger_. 
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/GCFigure.png?raw=true|width=100%!
> *[Root cause]* 
> After analyzing the heap dump, I found the guess is right (the 3.3GB 
> _ExternalAppendOnlyMap_ is actually not released). The 1.6GB object is 
> _ExternalAppendOnlyMap (E2)_.
> !https://github.com/JerryLead/Misc/blob/master/OOM-TasksMemoryManager/figures/heapdump.png?raw=true|width=100%!
> *[Question]* 
> Why the released _ExternalAppendOnlyMap_ is still in memory?
> The source code of

[jira] [Resolved] (SPARK-23503) continuous execution should sequence committed epochs

2018-05-18 Thread Tathagata Das (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23503.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20936
[https://github.com/apache/spark/pull/20936]

> continuous execution should sequence committed epochs
> -
>
> Key: SPARK-23503
> URL: https://issues.apache.org/jira/browse/SPARK-23503
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Jose Torres
>Priority: Major
> Fix For: 3.0.0
>
>
> Currently, the EpochCoordinator doesn't enforce a commit order. If a message 
> for epoch n gets lost in the ether, and epoch n + 1 happens to be ready for 
> commit earlier, epoch n + 1 will be committed.
>  
> This is either incorrect or needlessly confusing, because it's not safe to 
> start from the end offset of epoch n + 1 until epoch n is committed. 
> EpochCoordinator should enforce this sequencing.
>  
> Note that this is not actually a problem right now, because the commit 
> messages go through the same RPC channel from the same place. But we 
> shouldn't implicitly bake this assumption in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-23850) We should not redact username|user|url from UI by default

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23850:
--

Assignee: Marcelo Vanzin

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.2.2, 2.3.1, 2.4.0
>
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23850) We should not redact username|user|url from UI by default

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-23850.

   Resolution: Fixed
Fix Version/s: 2.3.1
   2.4.0
   2.2.2

Issue resolved by pull request 21365
[https://github.com/apache/spark/pull/21365]

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Assignee: Marcelo Vanzin
>Priority: Major
> Fix For: 2.2.2, 2.4.0, 2.3.1
>
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481307#comment-16481307
 ] 

Apache Spark commented on SPARK-16451:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21368

> Spark-shell / pyspark should finish gracefully when "SaslException: GSS 
> initiate failed" is hit
> ---
>
> Key: SPARK-16451
> URL: https://issues.apache.org/jira/browse/SPARK-16451
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Priority: Major
>
> Steps to reproduce: (secure cluster)
> * kdestroy
> * spark-shell --master yarn-client
> If no valid keytab is set while running spark-shell/pyspark, the spark client 
> never exits. It keep printing below error messages. 
> spark-client should call shutdown hook immediately and exit with proper error 
> code.
> Currently, user need to explicitly shutdown process. (using cntrl+c)
> {code}
> 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
> server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
>   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
>   at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at

[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16451:


Assignee: (was: Apache Spark)

> Spark-shell / pyspark should finish gracefully when "SaslException: GSS 
> initiate failed" is hit
> ---
>
> Key: SPARK-16451
> URL: https://issues.apache.org/jira/browse/SPARK-16451
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Priority: Major
>
> Steps to reproduce: (secure cluster)
> * kdestroy
> * spark-shell --master yarn-client
> If no valid keytab is set while running spark-shell/pyspark, the spark client 
> never exits. It keep printing below error messages. 
> spark-client should call shutdown hook immediately and exit with proper error 
> code.
> Currently, user need to explicitly shutdown process. (using cntrl+c)
> {code}
> 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
> server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
>   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
>   at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
>   at 
>

[jira] [Assigned] (SPARK-16451) Spark-shell / pyspark should finish gracefully when "SaslException: GSS initiate failed" is hit

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-16451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-16451:


Assignee: Apache Spark

> Spark-shell / pyspark should finish gracefully when "SaslException: GSS 
> initiate failed" is hit
> ---
>
> Key: SPARK-16451
> URL: https://issues.apache.org/jira/browse/SPARK-16451
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.1
>Reporter: Yesha Vora
>Assignee: Apache Spark
>Priority: Major
>
> Steps to reproduce: (secure cluster)
> * kdestroy
> * spark-shell --master yarn-client
> If no valid keytab is set while running spark-shell/pyspark, the spark client 
> never exits. It keep printing below error messages. 
> spark-client should call shutdown hook immediately and exit with proper error 
> code.
> Currently, user need to explicitly shutdown process. (using cntrl+c)
> {code}
> 16/07/08 20:53:10 WARN Client: Exception encountered while connecting to the 
> server : 
> javax.security.sasl.SaslException: GSS initiate failed [Caused by 
> GSSException: No valid credentials provided (Mechanism level: Failed to find 
> any Kerberos tgt)]
>   at 
> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:211)
>   at 
> org.apache.hadoop.security.SaslRpcClient.saslConnect(SaslRpcClient.java:413)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupSaslConnection(Client.java:595)
>   at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:397)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:761)
>   at org.apache.hadoop.ipc.Client$Connection$2.run(Client.java:757)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724)
>   at 
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:756)
>   at org.apache.hadoop.ipc.Client$Connection.access$3200(Client.java:397)
>   at org.apache.hadoop.ipc.Client.getConnection(Client.java:1617)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1448)
>   at org.apache.hadoop.ipc.Client.call(Client.java:1395)
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:233)
>   at com.sun.proxy.$Proxy25.getFileInfo(Unknown Source)
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.getFileInfo(ClientNamenodeProtocolTranslatorPB.java:816)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:256)
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)
>   at com.sun.proxy.$Proxy26.getFileInfo(Unknown Source)
>   at org.apache.hadoop.hdfs.DFSClient.getFileInfo(DFSClient.java:2151)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1408)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem$25.doCall(DistributedFileSystem.java:1404)
>   at 
> org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
>   at 
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1404)
>   at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1437)
>   at 
> org.apache.hadoop.yarn.client.api.impl.FileSystemTimelineWriter.(FileSystemTimelineWriter.java:124)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.createTimelineWriter(TimelineClientImpl.java:316)
>   at 
> org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl.serviceStart(TimelineClientImpl.java:308)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.hadoop.yarn.client.api.impl.YarnClientImpl.serviceStart(YarnClientImpl.java:194)
>   at 
> org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>   at 
> org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:127)
>   at 
> org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
>   at 
>

[jira] [Assigned] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24321:


Assignee: (was: Apache Spark)

> Extract common code from Divide/Remainder to a base trait
> -
>
> Key: SPARK-24321
> URL: https://issues.apache.org/jira/browse/SPARK-24321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> There's a lot of code duplication between {{Divide}} and {{Remainder}} 
> expression types. They're mostly the codegen template (which is exactly the 
> same, with just cosmetic differences), the eval function structure, etc.
> It tedious to have to update multiple places in case we make improvements to 
> the codegen templates in the future. This ticket proposes to refactor the 
> duplicate code into a common base trait for these two classes.
> Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
> and {{Remainder}}, so in theory we can make a deeper refactoring to 
> accommodate this class as well. But the "operation" part of its codegen 
> template is harder to factor into the base trait, so this ticket only handles 
> {{Divide}} and {{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481247#comment-16481247
 ] 

Apache Spark commented on SPARK-24321:
--

User 'rednaxelafx' has created a pull request for this issue:
https://github.com/apache/spark/pull/21367

> Extract common code from Divide/Remainder to a base trait
> -
>
> Key: SPARK-24321
> URL: https://issues.apache.org/jira/browse/SPARK-24321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Priority: Minor
>
> There's a lot of code duplication between {{Divide}} and {{Remainder}} 
> expression types. They're mostly the codegen template (which is exactly the 
> same, with just cosmetic differences), the eval function structure, etc.
> It tedious to have to update multiple places in case we make improvements to 
> the codegen templates in the future. This ticket proposes to refactor the 
> duplicate code into a common base trait for these two classes.
> Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
> and {{Remainder}}, so in theory we can make a deeper refactoring to 
> accommodate this class as well. But the "operation" part of its codegen 
> template is harder to factor into the base trait, so this ticket only handles 
> {{Divide}} and {{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24321:


Assignee: Apache Spark

> Extract common code from Divide/Remainder to a base trait
> -
>
> Key: SPARK-24321
> URL: https://issues.apache.org/jira/browse/SPARK-24321
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kris Mok
>Assignee: Apache Spark
>Priority: Minor
>
> There's a lot of code duplication between {{Divide}} and {{Remainder}} 
> expression types. They're mostly the codegen template (which is exactly the 
> same, with just cosmetic differences), the eval function structure, etc.
> It tedious to have to update multiple places in case we make improvements to 
> the codegen templates in the future. This ticket proposes to refactor the 
> duplicate code into a common base trait for these two classes.
> Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
> and {{Remainder}}, so in theory we can make a deeper refactoring to 
> accommodate this class as well. But the "operation" part of its codegen 
> template is harder to factor into the base trait, so this ticket only handles 
> {{Divide}} and {{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18778) Fix the Scala classpath in the spark-shell

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-18778:
--

Assignee: (was: Marcelo Vanzin)

> Fix the Scala classpath in the spark-shell
> --
>
> Key: SPARK-18778
> URL: https://issues.apache.org/jira/browse/SPARK-18778
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: DjvuLee
>Priority: Major
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-18778) Fix the Scala classpath in the spark-shell

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-18778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-18778:
--

Assignee: Marcelo Vanzin

> Fix the Scala classpath in the spark-shell
> --
>
> Key: SPARK-18778
> URL: https://issues.apache.org/jira/browse/SPARK-18778
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.1, 2.0.2
>Reporter: DjvuLee
>Assignee: Marcelo Vanzin
>Priority: Major
>
> Failed to initialize compiler: object scala.runtime in compiler mirror not 
> found.
> ** Note that as of 2.8 scala does not assume use of the java classpath.
> ** For the old behavior pass -usejavacp to scala, or if using a Settings
> ** object programatically, settings.usejavacp.value = true.
> Exception in thread "main" java.lang.AssertionError: assertion failed: null
> at scala.Predef$.assert(Predef.scala:179)
> at 
> org.apache.spark.repl.SparkIMain.initializeSynchronous(SparkIMain.scala:247)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:990)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at 
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at 
> org.apache.spark.repl.SparkILoop.org$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at 
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at 
> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24321) Extract common code from Divide/Remainder to a base trait

2018-05-18 Thread Kris Mok (JIRA)

Kris Mok created SPARK-24321:


 Summary: Extract common code from Divide/Remainder to a base trait
 Key: SPARK-24321
 URL: https://issues.apache.org/jira/browse/SPARK-24321
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kris Mok


There's a lot of code duplication between {{Divide}} and {{Remainder}} 
expression types. They're mostly the codegen template (which is exactly the 
same, with just cosmetic differences), the eval function structure, etc.

It tedious to have to update multiple places in case we make improvements to 
the codegen templates in the future. This ticket proposes to refactor the 
duplicate code into a common base trait for these two classes.

Non-goal: There another class, {{Pmod}}, that is also similiar to {{Divide}} 
and {{Remainder}}, so in theory we can make a deeper refactoring to accommodate 
this class as well. But the "operation" part of its codegen template is harder 
to factor into the base trait, so this ticket only handles {{Divide}} and 
{{Remainder}} for now.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-17723) "log4j:WARN No appenders could be found for logger" for spark-shell --proxy-user user

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-17723.

Resolution: Cannot Reproduce

This should have been fixed by SPARK-21728 (which initializes the logging 
system before this code runs). I tried locally and don't see that message. 
Please re-open if it's still an issue.

> "log4j:WARN No appenders could be found for logger" for spark-shell 
> --proxy-user user
> -
>
> Key: SPARK-17723
> URL: https://issues.apache.org/jira/browse/SPARK-17723
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Submit
>Affects Versions: 2.1.0
>Reporter: Jacek Laskowski
>Priority: Minor
>
> WARN messages are printed out when {{spark-shell}} starts with 
> {{--proxy-user}} command-line option.
> {code}
> $ ./bin/spark-shell --proxy-user user
> log4j:WARN No appenders could be found for logger 
> (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more 
> info.
> Using Spark's default log4j profile: 
> org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
> setLogLevel(newLevel).
> Spark context Web UI available at http://192.168.65.1:4040
> Spark context available as 'sc' (master = local[*], app id = 
> local-1475152321458).
> Spark session available as 'spark'.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
>   /_/
> Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
> Type in expressions to have them evaluated.
> Type :help for more information.
> scala> :quit
> $ ./bin/spark-shell --version
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 2.1.0-SNAPSHOT
>   /_/
> Branch master
> Compiled by user jacek on 2016-09-29T07:33:19Z
> Revision 37eb9184f1e9f1c07142c66936671f4711ef407d
> Url https://github.com/apache/spark.git
> Type --help for more information.
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-24308:
---

Assignee: Arun Mahadevan

> Handle DataReaderFactory to InputPartition renames in left over classes
> ---
>
> Key: SPARK-24308
> URL: https://issues.apache.org/jira/browse/SPARK-24308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Assignee: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
> InputPartitionReader. Some classes still reflects the old name and causes 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24308) Handle DataReaderFactory to InputPartition renames in left over classes

2018-05-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24308.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Handle DataReaderFactory to InputPartition renames in left over classes
> ---
>
> Key: SPARK-24308
> URL: https://issues.apache.org/jira/browse/SPARK-24308
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Arun Mahadevan
>Priority: Major
> Fix For: 2.4.0
>
>
> SPARK-24073 renames DataReaderFactory -> InputPartition and DataReader -> 
> InputPartitionReader. Some classes still reflects the old name and causes 
> confusion.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24248:


Assignee: Apache Spark

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Assignee: Apache Spark
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481217#comment-16481217
 ] 

Apache Spark commented on SPARK-24248:
--

User 'mccheah' has created a pull request for this issue:
https://github.com/apache/spark/pull/21366

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24248) [K8S] Use the Kubernetes cluster as the backing store for the state of pods

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24248:


Assignee: (was: Apache Spark)

> [K8S] Use the Kubernetes cluster as the backing store for the state of pods
> ---
>
> Key: SPARK-24248
> URL: https://issues.apache.org/jira/browse/SPARK-24248
> Project: Spark
>  Issue Type: Improvement
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> We have a number of places in KubernetesClusterSchedulerBackend right now 
> that maintains the state of pods in memory. However, the Kubernetes API can 
> always give us the most up to date and correct view of what our executors are 
> doing. We should consider moving away from in-memory state as much as can in 
> favor of using the Kubernetes cluster as the source of truth for pod status. 
> Maintaining less state in memory makes it so that there's a lower chance that 
> we accidentally miss updating one of these data structures and breaking the 
> lifecycle of executors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23856) Spark jdbc setQueryTimeout option

2018-05-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23856.
-
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Spark jdbc setQueryTimeout option
> -
>
> Key: SPARK-23856
> URL: https://issues.apache.org/jira/browse/SPARK-23856
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dmitry Mikhailov
>Assignee: Takeshi Yamamuro
>Priority: Minor
> Fix For: 2.4.0
>
>
> It would be nice if a user could set the jdbc setQueryTimeout option when 
> running jdbc in Spark. I think it can be easily implemented by adding option 
> field to _JDBCOptions_ class and applying this option when initializing jdbc 
> statements in _JDBCRDD_ class. But only some DB vendors support this jdbc 
> feature. Is it worth starting a work on this option?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-24149:
--

Assignee: Marco Gaido

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24149) Automatic namespaces discovery in HDFS federation

2018-05-18 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24149.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21216
[https://github.com/apache/spark/pull/21216]

> Automatic namespaces discovery in HDFS federation
> -
>
> Key: SPARK-24149
> URL: https://issues.apache.org/jira/browse/SPARK-24149
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.4.0
>Reporter: Marco Gaido
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> Hadoop 3 introduced HDFS federation.
> Spark fails to write on different namespaces when Hadoop federation is turned 
> on and the cluster is secure. This happens because Spark looks for the 
> delegation token only for the defaultFS configured and not for all the 
> available namespaces. A workaround is the usage of the property 
> {{spark.yarn.access.hadoopFileSystems}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3

2018-05-18 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-24312.
-
   Resolution: Fixed
 Assignee: Dongjoon Hyun
Fix Version/s: 2.4.0

> Upgrade to 2.3.3 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-24312
> URL: https://issues.apache.org/jira/browse/SPARK-24312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
>Priority: Major
> Fix For: 2.4.0
>
>
> Hive 2.3.3 is [released on April 
> 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843].
>  This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24320) Cannot read file names with spaces

2018-05-18 Thread Zachary Radtka (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zachary Radtka updated SPARK-24320:
---
Description: 
I am trying to read from a file on HDFS that has space in the file name, e.g. 
"file 1.csv" and I get a `java.io.FileNotFoundException: File does not exist` 
error.

The versions of software I am using are:
 * Spark: 2.2.0.2.6.3.0-235
 * Scala: version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

As an reproducible example I have the same file in HDFS named "file.csv" and 
"file 1.csv":
{code:none}
$ hdfs dfs -ls /tmp
rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file 1.csv
rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file.csv{code}
 

The following script was used to successfully read from the file that does not 
have a space in the name:
{code:java}
scala> val if1 = "/tmp/file.csv" if1: String = /tmp/file.csv scala> val 
origTable = spark.read.format("csv").option("header", 
"true").option("delimiter", ",").option("multiLine", true).option("escape", 
"\"").load(if1); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] 
scala> origTable.take(2) res3: Array[org.apache.spark.sql.Row] = Array([DATA 
REDACTED])
{code}
 

The same script was used to try and read from the file that has a space in the 
name:
{code:java}
 scala> val if2 = "/tmp/file 1.csv"
 if2: String = /tmp/file 1.csv

scala> val origTable = spark.read.format("csv").option("header", 
"true").option("delimiter", ",").option("multiLine", true).option("escape", 
"\"").load(if2);
 origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED]

scala> origTable.take(2)
 18/05/18 18:58:40 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
 java.io.FileNotFoundException: File does not exist: /tmp/file%201.csv
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:108)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
 at

[jira] [Created] (SPARK-24320) Cannot read file names with spaces

2018-05-18 Thread Zachary Radtka (JIRA)

Zachary Radtka created SPARK-24320:
--

 Summary:  Cannot read file names with spaces
 Key: SPARK-24320
 URL: https://issues.apache.org/jira/browse/SPARK-24320
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, SQL
Affects Versions: 2.2.0
Reporter: Zachary Radtka


I am trying to read from a file on HDFS that has space in the file name, e.g. 
"file 1.csv" and I get a `java.io.FileNotFoundException: File does not exist` 
error.

The versions of software I am using are:
 * Spark: 2.2.0.2.6.3.0-235
 * Scala: version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

As an reproducible example I have the same file in HDFS named "file.csv" and 
"file 1.csv":(
{code:none}
$ hdfs dfs -ls /tmp
rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file 1.csv
rw-rr- 3 hdfs hdfs 441646 2018-05-18 18:45 /tmp/file.csv{code}
 

The following script was used to successfully read from the file that does not 
have a space in the name:
{code}
scala> val if1 = "/tmp/file.csv" if1: String = /tmp/file.csv scala> val 
origTable = spark.read.format("csv").option("header", 
"true").option("delimiter", ",").option("multiLine", true).option("escape", 
"\"").load(if1); origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED] 
scala> origTable.take(2) res3: Array[org.apache.spark.sql.Row] = Array([DATA 
REDACTED])
{code}
 

The same script was used to try and read from the file that has a space in the 
name:
{code}
 scala> val if2 = "/tmp/file 1.csv"
 if2: String = /tmp/file 1.csv

scala> val origTable = spark.read.format("csv").option("header", 
"true").option("delimiter", ",").option("multiLine", true).option("escape", 
"\"").load(if2);
 origTable: org.apache.spark.sql.DataFrame = [DATA REDACTED]

scala> origTable.take(2)
 18/05/18 18:58:40 ERROR Executor: Exception in task 0.0 in stage 8.0 (TID 8)
 java.io.FileNotFoundException: File does not exist: /tmp/file%201.csv
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
 at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:2025)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1996)
 at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1909)
 at 
org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:700)
 at 
org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:377)
 at 
org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
 at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2351)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2347)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:422)
 at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866)
 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2347)

It is possible the underlying files have been updated. You can explicitly 
invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in 
SQL or by recreating the Dataset/DataFrame involved.
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:174)
 at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:105)
 at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
 at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
 at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
 at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
 at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at

[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481065#comment-16481065
 ] 

Apache Spark commented on SPARK-23850:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21365

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-18 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481062#comment-16481062
 ] 

H Lu commented on SPARK-23928:
--

OK. I finally got it. I read the code and found classOf[Random].getName is what 
I want!! 

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-18 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024
 ] 

H Lu edited comment on SPARK-23928 at 5/18/18 6:43 PM:
---

Dear Watchers ([~mn-mikke]  [~kiszk] [~viirya]),

Can someone help me with the code here? I would like to use Random. But when 
running tests, got the error:

Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 45, Column 15: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 15: Unknown variable or type "Random"

I already importe scala.util.Random. I am not yet an expert on CodeGen. So any 
comments would be appreciated! 
{code:java}
s"""
|for (int k = $length - 1; k >= 1; k--) {
| int l = Random.nextInt(k + 1); 
| $swapAssigments 
|}
""".stripMargin
{code}


was (Author: hzlu):
Dear Watchers ([~mn-mikke]  [~kiszk]),

Can someone help me with the code here? I would like to use Random. But when 
running tests, got the error:

Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 45, Column 15: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 15: Unknown variable or type "Random"

I already importe scala.util.Random. I am not yet an expert on CodeGen. So any 
comments would be appreciated! 
{code:java}
s"""
|for (int k = $length - 1; k >= 1; k--) {
| int l = Random.nextInt(k + 1); 
| $swapAssigments 
|}
""".stripMargin
{code}

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-18 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024
 ] 

H Lu edited comment on SPARK-23928 at 5/18/18 6:40 PM:
---

Dear Watchers ([~mn-mikke]  [~kiszk]),

Can someone help me with the code here? I would like to use Random. But when 
running tests, got the error:

Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 45, Column 15: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 15: Unknown variable or type "Random"

I already importe scala.util.Random. I am not yet an expert on CodeGen. So any 
comments would be appreciated! 
{code:java}
s"""
|for (int k = $length - 1; k >= 1; k--) {
| int l = Random.nextInt(k + 1); 
| $swapAssigments 
|}
""".stripMargin
{code}


was (Author: hzlu):
Dear Watchers ([~mn-mikke]  [~kiszk]),

Can someone help me with the code here? I would like to use Random. But when 
running tests, got the error:

Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 45, Column 15: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 15: Unknown variable or type "Random"

I am not yet an expert on CodeGen. So any comments would be appreciated! 
{code:java}
s"""
|for (int k = $length - 1; k >= 1; k--) {
| int l = Random.nextInt(k + 1); 
| $swapAssigments 
|}
""".stripMargin
{code}

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23928) High-order function: shuffle(x) → array

2018-05-18 Thread H Lu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481024#comment-16481024
 ] 

H Lu commented on SPARK-23928:
--

Dear Watchers ([~mn-mikke]  [~kiszk]),

Can someone help me with the code here? I would like to use Random. But when 
running tests, got the error:

Caused by: org.codehaus.commons.compiler.CompileException: File 
'generated.java', Line 45, Column 15: failed to compile: 
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 45, 
Column 15: Unknown variable or type "Random"

I am not yet an expert on CodeGen. So any comments would be appreciated! 
{code:java}
s"""
|for (int k = $length - 1; k >= 1; k--) {
| int l = Random.nextInt(k + 1); 
| $swapAssigments 
|}
""".stripMargin
{code}

> High-order function: shuffle(x) → array
> ---
>
> Key: SPARK-23928
> URL: https://issues.apache.org/jira/browse/SPARK-23928
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> Ref: https://prestodb.io/docs/current/functions/array.html
> Generate a random permutation of the given array x.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-05-18 Thread Anthony Cros (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16481016#comment-16481016
 ] 

Anthony Cros commented on SPARK-14220:
--

>> Lack of support for Scala 2.12 is holding back our adoption of Spark at 
>>Stripe

Likewise here for at least my team here at the Children's Hospital of 
Philadelphia (CHOP).

I was able to build Spark for 2.12 from source but it was a painful 
experience... I'd love regular updates on the progress of this as well!

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24319) run-example can not print usage

2018-05-18 Thread Bryan Cutler (JIRA)

Bryan Cutler created SPARK-24319:


 Summary: run-example can not print usage
 Key: SPARK-24319
 URL: https://issues.apache.org/jira/browse/SPARK-24319
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.4.0
Reporter: Bryan Cutler


Running "bin/run-example" with no args or with "–help" will not print usage and 
just gives the error
{noformat}
$ bin/run-example
Exception in thread "main" java.lang.IllegalArgumentException: Missing 
application resource.
    at 
org.apache.spark.launcher.CommandBuilderUtils.checkArgument(CommandBuilderUtils.java:241)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitArgs(SparkSubmitCommandBuilder.java:181)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildSparkSubmitCommand(SparkSubmitCommandBuilder.java:296)
    at 
org.apache.spark.launcher.SparkSubmitCommandBuilder.buildCommand(SparkSubmitCommandBuilder.java:162)
    at org.apache.spark.launcher.Main.main(Main.java:86){noformat}

it looks like there is an env var in the script that shows usage, but it's 
getting preempted by something else



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-14220) Build and test Spark against Scala 2.12

2018-05-18 Thread Andy Scott (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480981#comment-16480981
 ] 

Andy Scott commented on SPARK-14220:


Lack of support for Scala 2.12 is holding back our adoption of Spark at Stripe. 
Is there a good place to go to get an update on the progress and see a timeline 
for when a release might be available?

Additionally, Scala 2.13 will be released soon. What should we expect in terms 
of timeline for Spark support for Scala 2.13?

> Build and test Spark against Scala 2.12
> ---
>
> Key: SPARK-14220
> URL: https://issues.apache.org/jira/browse/SPARK-14220
> Project: Spark
>  Issue Type: Umbrella
>  Components: Build, Project Infra
>Affects Versions: 2.1.0
>Reporter: Josh Rosen
>Priority: Blocker
>
> This umbrella JIRA tracks the requirements for building and testing Spark 
> against the current Scala 2.12 milestone.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24159) Enable no-data micro batches for streaming mapGroupswithState

2018-05-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-24159.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21345
[https://github.com/apache/spark/pull/21345]

> Enable no-data micro batches for streaming mapGroupswithState
> -
>
> Key: SPARK-24159
> URL: https://issues.apache.org/jira/browse/SPARK-24159
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> When event-time timeout is enabled, then use watermark updates to decide 
> whether to run another batch
> When processing-time timeout is enabled, then use the processing time and to 
> decide when to run more batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24159) Enable no-data micro batches for streaming mapGroupswithState

2018-05-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-24159:


Assignee: Tathagata Das

> Enable no-data micro batches for streaming mapGroupswithState
> -
>
> Key: SPARK-24159
> URL: https://issues.apache.org/jira/browse/SPARK-24159
> Project: Spark
>  Issue Type: Sub-task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 2.4.0
>
>
> When event-time timeout is enabled, then use watermark updates to decide 
> whether to run another batch
> When processing-time timeout is enabled, then use the processing time and to 
> decide when to run more batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)

2018-05-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu reassigned SPARK-20538:


Assignee: Soham Aurangabadkar

> Dataset.reduce operator should use withNewExecutionId (as foreach or 
> foreachPartition)
> --
>
> Key: SPARK-20538
> URL: https://issues.apache.org/jira/browse/SPARK-20538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Assignee: Soham Aurangabadkar
>Priority: Trivial
> Fix For: 2.4.0
>
>
> {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed 
> in SQL tab (like {{foreach}} or {{foreachPartition}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20538) Dataset.reduce operator should use withNewExecutionId (as foreach or foreachPartition)

2018-05-18 Thread Shixiong Zhu (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-20538?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu resolved SPARK-20538.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21316
[https://github.com/apache/spark/pull/21316]

> Dataset.reduce operator should use withNewExecutionId (as foreach or 
> foreachPartition)
> --
>
> Key: SPARK-20538
> URL: https://issues.apache.org/jira/browse/SPARK-20538
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Jacek Laskowski
>Priority: Trivial
> Fix For: 2.4.0
>
>
> {{Dataset.reduce}} is not tracked using {{executionId}} so it's not displayed 
> in SQL tab (like {{foreach}} or {{foreachPartition}}).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler resolved SPARK-24303.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480912#comment-16480912
 ] 

Bryan Cutler commented on SPARK-24303:
--

Issue resolved by pull request 21350

https://github.com/apache/spark/pull/21350

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.4.0
>
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24303) Update cloudpickle to v0.4.4

2018-05-18 Thread Bryan Cutler (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Cutler reassigned SPARK-24303:


Assignee: Hyukjin Kwon

> Update cloudpickle to v0.4.4
> 
>
> Key: SPARK-24303
> URL: https://issues.apache.org/jira/browse/SPARK-24303
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> cloudpickle 0.4.4 is release - 
> https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.4
> The main difference is that we are now able to pickle root logger.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24318) Flaky test: SortShuffleSuite

2018-05-18 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-24318:
-

 Summary: Flaky test: SortShuffleSuite
 Key: SPARK-24318
 URL: https://issues.apache.org/jira/browse/SPARK-24318
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Dongjoon Hyun


- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/346/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/336/

{code}
Error Message

java.io.IOException: Failed to delete: 
/home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905

Stacktrace

sbt.ForkMain$ForkError: java.io.IOException: Failed to delete: 
/home/jenkins/workspace/spark-branch-2.3-test-sbt-hadoop-2.7/target/tmp/spark-14031101-7989-4fe2-81eb-a394311ab905
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1073)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24211) Flaky test: StreamingOuterJoinSuite

2018-05-18 Thread Dongjoon Hyun (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-24211:
--
Description: 
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/

*left outer join with non-key condition violated*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/

*left outer early state exclusion on left*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375

  was:
*windowed left outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/

*windowed right outer join*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/

*left outer join with non-key condition violated*
- 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/


> Flaky test: StreamingOuterJoinSuite
> ---
>
> Key: SPARK-24211
> URL: https://issues.apache.org/jira/browse/SPARK-24211
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> *windowed left outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/330/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/317/
> *windowed right outer join*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/334/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/328/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/371/
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.6/345/
> *left outer join with non-key condition violated*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-sbt-hadoop-2.7/337/
> *left outer early state exclusion on left*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.3-test-maven-hadoop-2.7/375



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480836#comment-16480836
 ] 

Apache Spark commented on SPARK-24317:
--

User 'cxzl25' has created a pull request for this issue:
https://github.com/apache/spark/pull/21364

> Float-point numbers are displayed with different precision in ThriftServer2
> ---
>
> Key: SPARK-24317
> URL: https://issues.apache.org/jira/browse/SPARK-24317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Minor
>
> When querying float-point numbers , the values displayed on beeline or jdbc 
> are with different precision.
> {code:java}
> SELECT CAST(1.23 AS FLOAT)
> Result:
> 1.230190734863
> {code}
> According to these two jira:
> [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802]
> [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832]
> Make a slight modification to the spark hive thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24317:


Assignee: Apache Spark

> Float-point numbers are displayed with different precision in ThriftServer2
> ---
>
> Key: SPARK-24317
> URL: https://issues.apache.org/jira/browse/SPARK-24317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Assignee: Apache Spark
>Priority: Minor
>
> When querying float-point numbers , the values displayed on beeline or jdbc 
> are with different precision.
> {code:java}
> SELECT CAST(1.23 AS FLOAT)
> Result:
> 1.230190734863
> {code}
> According to these two jira:
> [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802]
> [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832]
> Make a slight modification to the spark hive thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24317:


Assignee: (was: Apache Spark)

> Float-point numbers are displayed with different precision in ThriftServer2
> ---
>
> Key: SPARK-24317
> URL: https://issues.apache.org/jira/browse/SPARK-24317
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.1.0, 2.2.0, 2.3.0
>Reporter: dzcxzl
>Priority: Minor
>
> When querying float-point numbers , the values displayed on beeline or jdbc 
> are with different precision.
> {code:java}
> SELECT CAST(1.23 AS FLOAT)
> Result:
> 1.230190734863
> {code}
> According to these two jira:
> [HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802]
> [HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832]
> Make a slight modification to the spark hive thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24317) Float-point numbers are displayed with different precision in ThriftServer2

2018-05-18 Thread dzcxzl (JIRA)

dzcxzl created SPARK-24317:
--

 Summary: Float-point numbers are displayed with different 
precision in ThriftServer2
 Key: SPARK-24317
 URL: https://issues.apache.org/jira/browse/SPARK-24317
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0, 2.2.0, 2.1.0, 2.0.0
Reporter: dzcxzl


When querying float-point numbers , the values displayed on beeline or jdbc are 
with different precision.
{code:java}
SELECT CAST(1.23 AS FLOAT)
Result:
1.230190734863
{code}
According to these two jira:

[HIVE-11802|https://issues.apache.org/jira/browse/HIVE-11802]
[HIVE-11832|https://issues.apache.org/jira/browse/HIVE-11832]

Make a slight modification to the spark hive thrift server.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24316) Spark sql queries stall for column width more 6k for parquet based table

2018-05-18 Thread Bimalendu Choudhary (JIRA)

Bimalendu Choudhary created SPARK-24316:
---

 Summary: Spark sql queries stall for  column width more 6k for 
parquet based table
 Key: SPARK-24316
 URL: https://issues.apache.org/jira/browse/SPARK-24316
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.2.1, 2.2.0
Reporter: Bimalendu Choudhary


When we create a table from a data frame using spark sql with columns around 6k 
or more, even simple queries of fetching 70k rows takes 20 minutes, while the 
same table if we create through Hive with same data , the same query just takes 
5 minutes.

 

Instrumenting the code we see that the executors are looping in the while loop 
of the function initializeInternal(). The majority of time is getting spent 
here and the executor seems to be stalled for long time .

[VectorizedParquetRecordReader.java|http://opengrok.sjc.cloudera.com/source/xref/spark-2.2.0-cloudera1/sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java]

private void initializeInternal() ..
..
 for (int i = 0; i < requestedSchema.getFieldCount(); ++i) {

...
 }
}

When spark sql is creating table, it also stores the metadata in the 
TBLPROPERTIES in json format. We see that if we remove this metadata from the 
table the queries become fast , which is the case when we create the same table 
through Hive. The exact same table takes 5 times more time with the Json meta 
data as compared to without the json metadata.

 

So looks like as the number of columns are growing bigger than 5 to 6k, the 
processing of the metadata and comparing it becomes more and more expensive and 
the performance degrades drastically.

To recreate the problem simply run the following query:

import org.apache.spark.sql.SparkSession

val resp_data = spark.sql("SELECT * FROM duplicatefgv limit 7")

 resp_data.write.format("csv").save("/tmp/filename")

 

The table should be created by spark sql from dataframe so that the Json meta 
data is stored. For ex:-

val dff =  spark.read.format("csv").load("hdfs:///tmp/test.csv")

dff.createOrReplaceTempView("my_temp_table")

 val tmp = spark.sql("Create table tableName stored as parquet as select * from 
my_temp_table")

 

 

from pyspark.sql import SQL

Context 
sqlContext = SQLContext(sc) 
resp_data = spark.sql( " select * from test").limit(2000) 
print resp_data_fgv_1k.count() 
(resp_data_fgv_1k.write.option('header', 
False).mode('overwrite').csv('/tmp/2.csv') ) 

 

 

The performance seems to be even slow in the loop if the schema does not match 
or the fields are empty and the code goes into the if condition where the 
missing column is marked true:

missingColumns[i] = true;

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-17557) SQL query on parquet table java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary

2018-05-18 Thread Padma (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480723#comment-16480723
 ] 

Padma commented on SPARK-17557:
---

I encounter an issue when data resides in Hive as parquet format and when 
trying to read from Spark (2.2.1), facing the above issue. I notice that in my 
case there is date field (contains values as 2018, 2017) which is written as 
integer. But when reading in spark as -

val df = spark.sql("SELECT * FROM db.table") 

df.show(3, false)
java.lang.UnsupportedOperationException: 
org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
 
To my surprise when reading same data from s3 location as -
val df = spark.read.parquet("s3://path/file")
df.show(3, false) // this displays the results.
 
- Padma
 

> SQL query on parquet table java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary
> -
>
> Key: SPARK-17557
> URL: https://issues.apache.org/jira/browse/SPARK-17557
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Egor Pahomov
>Priority: Major
>
> Working on 1.6.2, broken on 2.0
> {code}
> select * from logs.a where year=2016 and month=9 and day=14 limit 100
> {code}
> {code}
> java.lang.UnsupportedOperationException: 
> org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
>   at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
>   at 
> org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-23667) Scala version check will fail due to launcher directory doesn't exist

2018-05-18 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23667.
---
Resolution: Not A Problem

> Scala version check will fail due to launcher directory doesn't exist
> -
>
> Key: SPARK-23667
> URL: https://issues.apache.org/jira/browse/SPARK-23667
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Chenzhao Guo
>Priority: Major
>
> In some cases when outer project use pre-built Spark as dependency, 
> {{getScalaVersion}} will fail due to {{launcher}} directory doesn't exist. 
> This PR also checks in {{jars}} directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480582#comment-16480582
 ] 

Apache Spark commented on SPARK-19228:
--

User 'sergey-rubtsov' has created a pull request for this issue:
https://github.com/apache/spark/pull/21363

> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Current FastDateFormat can't properly parse date and timestamp and does not 
> meet the ISO8601.
> That is why there is now supporting for inferring DateType and custom 
> "dateFormat" option for csv parsing.
> For example, I need to process user.csv like this:
> {code:java}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code:java}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is:
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This mean that date processed as string and "dateFormat" option is ignored.
>  If I add option
> {code:java}
> .option("timestampFormat", "dd/MM/")
> {code}
> result is:
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: timestamp (nullable = true)
>  |-- ended: timestamp (nullable = true)
> {code}
> I think, the issue is somewhere in object CSVInferSchema, function 
> inferField, lines 80-97 and
>  method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
> date/timestamp process logic need to be changed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24197) add array_sort function

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24197?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480514#comment-16480514
 ] 

Apache Spark commented on SPARK-24197:
--

User 'mn-mikke' has created a pull request for this issue:
https://github.com/apache/spark/pull/21362

> add array_sort function
> ---
>
> Key: SPARK-24197
> URL: https://issues.apache.org/jira/browse/SPARK-24197
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Marek Novotny
>Assignee: Marek Novotny
>Priority: Major
> Fix For: 2.4.0
>
>
> Add a SparkR equivalent function to 
> [SPARK-23921|https://issues.apache.org/jira/browse/SPARK-23921].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-19228) inferSchema function processed csv date column as string and "dateFormat" DataSource option is ignored

2018-05-18 Thread Sergey Rubtsov (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sergey Rubtsov updated SPARK-19228:
---
Description: 
Current FastDateFormat can't properly parse date and timestamp and does not 
meet the ISO8601.

That is why there is now supporting for inferring DateType and custom 
"dateFormat" option for csv parsing.
For example, I need to process user.csv like this:
{code:java}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code:java}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
 If I add option
{code:java}
.option("timestampFormat", "dd/MM/")
{code}
result is:
{code:java}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}
I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
 method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.

  was:
I need to process user.csv like this:
{code}
id,project,started,ended
sergey.rubtsov,project0,12/12/2012,10/10/2015
{code}
When I add date format options:
{code}
Dataset users = spark.read().format("csv").option("mode", 
"PERMISSIVE").option("header", "true")
.option("inferSchema", 
"true").option("dateFormat", "dd/MM/").load("src/main/resources/user.csv");
users.printSchema();
{code}
expected scheme should be 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: date (nullable = true)
 |-- ended: date (nullable = true)
{code}
but the actual result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: string (nullable = true)
 |-- ended: string (nullable = true)
{code}
This mean that date processed as string and "dateFormat" option is ignored.
If I add option 
{code}
.option("timestampFormat", "dd/MM/")
{code}
result is: 
{code}
root
 |-- id: string (nullable = true)
 |-- project: string (nullable = true)
 |-- started: timestamp (nullable = true)
 |-- ended: timestamp (nullable = true)
{code}

I think, the issue is somewhere in object CSVInferSchema, function inferField, 
lines 80-97 and
method "tryParseDate" need to be added before/after "tryParseTimestamp", or 
date/timestamp process logic need to be changed.


> inferSchema function processed csv date column as string and "dateFormat" 
> DataSource option is ignored
> --
>
> Key: SPARK-19228
> URL: https://issues.apache.org/jira/browse/SPARK-19228
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Affects Versions: 2.1.0
>Reporter: Sergey Rubtsov
>Priority: Major
>  Labels: easyfix
>   Original Estimate: 6h
>  Remaining Estimate: 6h
>
> Current FastDateFormat can't properly parse date and timestamp and does not 
> meet the ISO8601.
> That is why there is now supporting for inferring DateType and custom 
> "dateFormat" option for csv parsing.
> For example, I need to process user.csv like this:
> {code:java}
> id,project,started,ended
> sergey.rubtsov,project0,12/12/2012,10/10/2015
> {code}
> When I add date format options:
> {code:java}
> Dataset users = spark.read().format("csv").option("mode", 
> "PERMISSIVE").option("header", "true")
> .option("inferSchema", 
> "true").option("dateFormat", 
> "dd/MM/").load("src/main/resources/user.csv");
>   users.printSchema();
> {code}
> expected scheme should be
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: date (nullable = true)
>  |-- ended: date (nullable = true)
> {code}
> but the actual result is:
> {code:java}
> root
>  |-- id: string (nullable = true)
>  |-- project: string (nullable = true)
>  |-- started: string (nullable = true)
>  |-- ended: string (nullable = true)
> {code}
> This

[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24313:

Description: Several functions working on collection return incorrect 
result for complex data types in interpreted mode. In particular, we consider 
comple data types BINARY, ARRAY. The list of the affected functions is: 
{{array_contains}}, {{array_position}}, {{element_at}} and {{GetMapValue}}.  
(was: The functions {{array_contains}} and {{array_position}} return incorrect 
result for complex data types in interpreted mode. In particular, for arrays, 
binarys, etc. returns always false.)

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> Several functions working on collection return incorrect result for complex 
> data types in interpreted mode. In particular, we consider comple data types 
> BINARY, ARRAY. The list of the affected functions is: {{array_contains}}, 
> {{array_position}}, {{element_at}} and {{GetMapValue}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24313) Collection functions interpreted execution doesn't work with complex types

2018-05-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24313:

Summary: Collection functions interpreted execution doesn't work with 
complex types  (was: array_contains/array_position interpreted execution 
doesn't work with complex types)

> Collection functions interpreted execution doesn't work with complex types
> --
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The functions {{array_contains}} and {{array_position}} return incorrect 
> result for complex data types in interpreted mode. In particular, for arrays, 
> binarys, etc. returns always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480482#comment-16480482
 ] 

Kazuaki Ishizaki commented on SPARK-24314:
--

I am working for this.

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki reopened SPARK-24314:
--

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24314) interpreted element_at or GetMapValue does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki updated SPARK-24314:
-
Summary: interpreted element_at or GetMapValue does not work for complex 
types  (was: interpreted array_position does not work for complex types)

> interpreted element_at or GetMapValue does not work for complex types
> -
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24315) Multiple streaming jobs detected error causing job failure

2018-05-18 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-24315:
---

 Summary: Multiple streaming jobs detected error causing job failure
 Key: SPARK-24315
 URL: https://issues.apache.org/jira/browse/SPARK-24315
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.3.0
Reporter: Marco Gaido


We are running a simple structured streaming job. It reads data from Kafka and 
writes it to HDFS. Unfortunately at startup, the application fails with the 
following error. After some restarts the application finally starts 
successfully.

{code}
org.apache.spark.sql.streaming.StreamingQueryException: assertion failed: 
Concurrent update to the log. Multiple streaming jobs detected for 1
=== Streaming Query ===

at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:295)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: java.lang.AssertionError: assertion failed: Concurrent update to the 
log. Multiple streaming jobs detected for 1
at scala.Predef$.assert(Predef.scala:170)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply$mcV$sp(MicroBatchExecution.scala:339)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch$1.apply(MicroBatchExecution.scala:338)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$constructNextBatch(MicroBatchExecution.scala:338)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:128)
at
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:121)
at 
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56)
at 
org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:117)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
{code}

We have not set any value for `spark.streaming.concurrentJobs`. Our code looks 
like:

{code}
  // read from kafka
  .withWatermark("timestamp", "30 minutes")
  .groupBy(window($"timestamp", "1 hour", "30 minutes"), ...).count()
  // simple select of some fields with casts
  .coalesce(1)
  .writeStream
  .trigger(Trigger.ProcessingTime("2 minutes"))
  .option("checkpointLocation", checkpointDir)
  // write to HDFS
  .start()
  .awaitTermination()
{code}

This may also be related to the presence of some data in the kafka queue to 
process, so the time for the first batch may be longer than usual (as it is 
quite common I think).




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24313) array_contains/array_position interpreted execution doesn't work with complex types

2018-05-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24313:

Description: The functions {{array_contains}} and {{array_position}} return 
incorrect result for complex data types in interpreted mode. In particular, for 
arrays, binarys, etc. returns always false.  (was: The function 
{{array_contains}} returns incorrect result for complex data types in 
interpreted mode. In particular, for arrays, binarys, etc. returns always 
false.)

> array_contains/array_position interpreted execution doesn't work with complex 
> types
> ---
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The functions {{array_contains}} and {{array_position}} return incorrect 
> result for complex data types in interpreted mode. In particular, for arrays, 
> binarys, etc. returns always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24313) array_contains/array_position interpreted execution doesn't work with complex types

2018-05-18 Thread Marco Gaido (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marco Gaido updated SPARK-24313:

Summary: array_contains/array_position interpreted execution doesn't work 
with complex types  (was: array_contains interpreted execution doesn't work 
with complex types)

> array_contains/array_position interpreted execution doesn't work with complex 
> types
> ---
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The function {{array_contains}} returns incorrect result for complex data 
> types in interpreted mode. In particular, for arrays, binarys, etc. returns 
> always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24314) interpreted array_position does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuaki Ishizaki resolved SPARK-24314.
--
Resolution: Duplicate

> interpreted array_position does not work for complex types
> --
>
> Key: SPARK-24314
> URL: https://issues.apache.org/jira/browse/SPARK-24314
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Kazuaki Ishizaki
>Priority: Major
>
> The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used

2018-05-18 Thread Morten Hornbech (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480454#comment-16480454
 ] 

Morten Hornbech commented on SPARK-23614:
-

In the example provided caching is required to produce the bug, and I'm pretty 
sure aggregation is required as well

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.1, 2.4.0
>
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24314) interpreted array_position does not work for complex types

2018-05-18 Thread Kazuaki Ishizaki (JIRA)

Kazuaki Ishizaki created SPARK-24314:


 Summary: interpreted array_position does not work for complex types
 Key: SPARK-24314
 URL: https://issues.apache.org/jira/browse/SPARK-24314
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Kazuaki Ishizaki


The same reason in SPARK-24313



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-23614) Union produces incorrect results when caching is used

2018-05-18 Thread Yu-Jhe Li (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480438#comment-16480438
 ] 

Yu-Jhe Li commented on SPARK-23614:
---

Is this bug happening only when 1) cached dataframe 2) aggregation?

> Union produces incorrect results when caching is used
> -
>
> Key: SPARK-23614
> URL: https://issues.apache.org/jira/browse/SPARK-23614
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Morten Hornbech
>Assignee: Liang-Chi Hsieh
>Priority: Major
>  Labels: correctness
> Fix For: 2.3.1, 2.4.0
>
>
> We just upgraded from 2.2 to 2.3 and our test suite caught this error:
> {code:java}
> case class TestData(x: Int, y: Int, z: Int)
> val frame = session.createDataset(Seq(TestData(1, 2, 3), TestData(4, 5, 
> 6))).cache()
> val group1 = frame.groupBy("x").agg(min(col("y")) as "value")
> val group2 = frame.groupBy("x").agg(min(col("z")) as "value")
> group1.union(group2).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 2|
> // | 4| 5|
> // | 1| 2|
> // | 4| 5|
> // +---+-+
> group2.union(group1).show()
> // +---+-+
> // | x|value|
> // +---+-+
> // | 1| 3|
> // | 4| 6|
> // | 1| 3|
> // | 4| 6|
> // +---+-+
> {code}
> The error disappears if the first data frame is not cached or if the two 
> group by's use separate copies. I'm not sure exactly what happens on the 
> insides of Spark, but errors that produce incorrect results rather than 
> exceptions always concerns me.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24313) array_contains interpreted execution doesn't work with complex types

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480392#comment-16480392
 ] 

Apache Spark commented on SPARK-24313:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21361

> array_contains interpreted execution doesn't work with complex types
> 
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The function {{array_contains}} returns incorrect result for complex data 
> types in interpreted mode. In particular, for arrays, binarys, etc. returns 
> always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24313) array_contains interpreted execution doesn't work with complex types

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24313:


Assignee: (was: Apache Spark)

> array_contains interpreted execution doesn't work with complex types
> 
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Priority: Minor
>
> The function {{array_contains}} returns incorrect result for complex data 
> types in interpreted mode. In particular, for arrays, binarys, etc. returns 
> always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24313) array_contains interpreted execution doesn't work with complex types

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24313:


Assignee: Apache Spark

> array_contains interpreted execution doesn't work with complex types
> 
>
> Key: SPARK-24313
> URL: https://issues.apache.org/jira/browse/SPARK-24313
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> The function {{array_contains}} returns incorrect result for complex data 
> types in interpreted mode. In particular, for arrays, binarys, etc. returns 
> always false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24313) array_contains interpreted execution doesn't work with complex types

2018-05-18 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-24313:
---

 Summary: array_contains interpreted execution doesn't work with 
complex types
 Key: SPARK-24313
 URL: https://issues.apache.org/jira/browse/SPARK-24313
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Marco Gaido


The function {{array_contains}} returns incorrect result for complex data types 
in interpreted mode. In particular, for arrays, binarys, etc. returns always 
false.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-24302) when using spark persist(),"KryoException:IndexOutOfBoundsException" happens

2018-05-18 Thread yijukang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

yijukang updated SPARK-24302:
-
Labels: apache-spark  (was: )

> when using spark persist(),"KryoException:IndexOutOfBoundsException" happens
> 
>
> Key: SPARK-24302
> URL: https://issues.apache.org/jira/browse/SPARK-24302
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 1.6.0
>Reporter: yijukang
>Priority: Major
>  Labels: apache-spark
>
> my operation is using spark to insert RDD data into Hbase like this:
> --
> localData.persist()
>  localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration)
> --
> this way throw Exception:
>    com.esotericsoftware.kryo.KryoException: 
> java.lang.IndexOutOfBoundsException:index:99, Size:6
> Serialization trace:
>     familyMap (org.apache.hadoop.hbase.client.Put)
>   at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
>    at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729)
>    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:42)
>    at com.twitter.chill.Tuple2Serializer.read(TupleSerializers.scala:33)
>    at com.esotericsoftware.kryo.kryo.readClassAndObject(Kryo.java:729)
>  
> when i deal with this:
> -
>  localData.saveAsNewAPIHadoopDataset(jobConf.getConfiguration)
> --
> it works well，what the persist() method did?
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-24277) Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter

2018-05-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-24277.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21329
[https://github.com/apache/spark/pull/21329]

> Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter
> ---
>
> Key: SPARK-24277
> URL: https://issues.apache.org/jira/browse/SPARK-24277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 2.4.0
>
>
> In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary 
> settings in hadoop configuration.
> Also clean up some code in SQL module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24277) Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter

2018-05-18 Thread Wenchen Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-24277:
---

Assignee: Gengliang Wang

> Code clean up in SQL module: HadoopMapReduceCommitProtocol/FileFormatWriter
> ---
>
> Key: SPARK-24277
> URL: https://issues.apache.org/jira/browse/SPARK-24277
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Gengliang Wang
>Assignee: Gengliang Wang
>Priority: Trivial
> Fix For: 2.4.0
>
>
> In HadoopMapReduceCommitProtocol and FileFormatWriter, there are unnecessary 
> settings in hadoop configuration.
> Also clean up some code in SQL module.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-05-18 Thread Maryann Xue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480237#comment-16480237
 ] 

Maryann Xue commented on SPARK-24288:
-

Thank you for pointing this out, [~cloud_fan]! I made {{OptimizerBarrier}} 
inherit from {{UnaryNode}} as a proof of concept that can quickly pass the 
basic tests, but it was not the optimal solution. I've just created a PR, so 
you guys can all take a look.

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24288) Enable preventing predicate pushdown

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480232#comment-16480232
 ] 

Apache Spark commented on SPARK-24288:
--

User 'maryannxue' has created a pull request for this issue:
https://github.com/apache/spark/pull/21360

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24288) Enable preventing predicate pushdown

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24288:


Assignee: (was: Apache Spark)

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24288) Enable preventing predicate pushdown

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24288:


Assignee: Apache Spark

> Enable preventing predicate pushdown
> 
>
> Key: SPARK-24288
> URL: https://issues.apache.org/jira/browse/SPARK-24288
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Tomasz Gawęda
>Assignee: Apache Spark
>Priority: Major
> Attachments: SPARK-24288.simple.patch
>
>
> Issue discussed on Mailing List: 
> [http://apache-spark-developers-list.1001551.n3.nabble.com/Preventing-predicate-pushdown-td23976.html]
> While working with JDBC datasource I saw that many "or" clauses with 
> non-equality operators causes huge performance degradation of SQL query 
> to database (DB2). For example: 
> val df = spark.read.format("jdbc").(other options to parallelize 
> load).load() 
> df.where(s"(date1 > $param1 and (date1 < $param2 or date1 is null) or x 
>  > 100)").show() // in real application whose predicates were pushed 
> many many lines below, many ANDs and ORs 
> If I use cache() before where, there is no predicate pushdown of this 
> "where" clause. However, in production system caching many sources is a 
> waste of memory (especially is pipeline is long and I must do cache many 
> times).There are also few more workarounds, but it would be great if Spark 
> will support preventing predicate pushdown by user.
>  
> For example: df.withAnalysisBarrier().where(...) ?
>  
> Note, that this should not be a global configuration option. If I read 2 
> DataFrames, df1 and df2, I would like to specify that df1 should not have 
> some predicates pushed down, but some may be, but df2 should have all 
> predicates pushed down, even if target query joins df1 and df2. As far as I 
> understand Spark optimizer, if we use functions like `withAnalysisBarrier` 
> and put AnalysisBarrier explicitly in logical plan, then predicates won't be 
> pushed down on this particular DataFrames and PP will be still possible on 
> the second one.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3

2018-05-18 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16480218#comment-16480218
 ] 

Apache Spark commented on SPARK-24312:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/21359

> Upgrade to 2.3.3 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-24312
> URL: https://issues.apache.org/jira/browse/SPARK-24312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 2.3.3 is [released on April 
> 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843].
>  This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24312:


Assignee: (was: Apache Spark)

> Upgrade to 2.3.3 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-24312
> URL: https://issues.apache.org/jira/browse/SPARK-24312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Hive 2.3.3 is [released on April 
> 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843].
>  This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3

2018-05-18 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24312:


Assignee: Apache Spark

> Upgrade to 2.3.3 for Hive Metastore Client 2.3
> --
>
> Key: SPARK-24312
> URL: https://issues.apache.org/jira/browse/SPARK-24312
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Hive 2.3.3 is [released on April 
> 3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843].
>  This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24312) Upgrade to 2.3.3 for Hive Metastore Client 2.3

2018-05-18 Thread Dongjoon Hyun (JIRA)

Dongjoon Hyun created SPARK-24312:
-

 Summary: Upgrade to 2.3.3 for Hive Metastore Client 2.3
 Key: SPARK-24312
 URL: https://issues.apache.org/jira/browse/SPARK-24312
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.0
Reporter: Dongjoon Hyun


Hive 2.3.3 is [released on April 
3rd|https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342162=Text=12310843].
 This issue aims to upgrade Hive Metastore Client 2.3 from 2.3.2 to 2.3.3.





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

82 matches

Mail list logo