date:20170908

[jira] [Commented] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159623#comment-16159623
 ] 

Apache Spark commented on SPARK-21902:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19171

> BlockManager.doPut will hide actually exception when exception thrown in 
> finally block
> --
>
> Key: SPARK-21902
> URL: https://issues.apache.org/jira/browse/SPARK-21902
> Project: Spark
>  Issue Type: Wish
>  Components: Block Manager
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> As logging below, actually exception will be hidden when removeBlockInternal 
> throw an exception.
> {code:java}
> 2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting 
> block broadcast_110 failed due to an exception
> 2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: 
> Failed to create a new broadcast in 1 attempts
> java.io.IOException: Failed to create local dir in 
> /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e.
> at 
> org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70)
> at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115)
> at 
> org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339)
> at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910)
> at 
> org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948)
> at 
> org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726)
> at 
> org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> at 
> org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60)
> at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
> at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58)
> at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002)
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770)
> at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
> at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770)
> at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620)
> at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609)
> at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}
> I want to print the exception first for troubleshooting.Or may be we should 
> not throw exception when removing blocks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21962) Distributed Tracing in Spark

2017-09-08 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-21962:
--

 Summary: Distributed Tracing in Spark
 Key: SPARK-21962
 URL: https://issues.apache.org/jira/browse/SPARK-21962
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 2.2.0
Reporter: Andrew Ash


Spark should support distributed tracing, which is the mechanism, widely 
popularized by Google in the [Dapper 
Paper|https://research.google.com/pubs/pub36356.html], where network requests 
have additional metadata used for tracing requests between services.

This would be useful for me since I have OpenZipkin style tracing in my 
distributed application up to the Spark driver, and from the executors out to 
my other services, but the link is broken in Spark between driver and executor 
since the Span IDs aren't propagated across that link.

An initial implementation could instrument the most important network calls 
with trace ids (like launching and finishing tasks), and incrementally add more 
tracing to other calls (torrent block distribution, external shuffle service, 
etc) as the feature matures.

Search keywords: Dapper, Brave, OpenZipkin, HTrace



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21961:


Assignee: Apache Spark

> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
>Assignee: Apache Spark
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> We have deployed our Spark History Server with this filter which works fine 
> in our production cluster, which has processed thousands of logs and only got 
> several full GC in total.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
> !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21961:


Assignee: (was: Apache Spark)

> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> We have deployed our Spark History Server with this filter which works fine 
> in our production cluster, which has processed thousands of logs and only got 
> several full GC in total.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
> !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159514#comment-16159514
 ] 

Ye Zhou commented on SPARK-21961:
-

Pull Request Added: https://github.com/apache/spark/pull/19170

> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> We have deployed our Spark History Server with this filter which works fine 
> in our production cluster, which has processed thousands of logs and only got 
> several full GC in total.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
> !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Description: 
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 23GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
We have deployed our Spark History Server with this filter which works fine in 
our production cluster, which has processed thousands of logs and only got 
several full GC in total.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
!https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!


  was:
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 23GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
!https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Description: 
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 23GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
!https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!


  was:
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
!https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!



> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 23GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
>

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Description: 
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
!https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png!


  was:
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!



> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 24GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!
>

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Description: 
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.
!https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!


  was:
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.



> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 24GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.
> !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Description: 
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.


  was:
As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.


> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 24GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ye Zhou updated SPARK-21961:

Attachment: Objects_Count_in_Heap.png
One_Thread_Took_24GB.png

> Filter out BlockStatuses Accumulators during replaying history logs in Spark 
> History Server
> ---
>
> Key: SPARK-21961
> URL: https://issues.apache.org/jira/browse/SPARK-21961
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Ye Zhou
> Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png
>
>
> As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
> memory in Driver. Recently we also noticed the same issue in Spark History 
> Server. Even though in SPARK-20084, those event logs are getting removed from 
> history log. But multiple versions of Spark including 1.6.x and 2.1.0 
> versions are deployed in our production cluster, none of them have these two 
> patches included.
> In this case, those event logs will still be in shown up in logs and Spark 
> History Server will replay them. Spark History Server continuously get severe 
> Full GCs even though we tried to limit cache size as well as enlarge the 
> heapsize to 40GB. We also tried with different GC tuning parameters, like 
> using CMS or G1GC. None of them works.
> We made a heap dump, and found that the top memory consumer objects is 
> BlockStatus. There was even one thread that took 24GB heap which was 
> replaying one log file.
> Since the former two tickets has resolved related issues in both driver and 
> writing to history logs, we should also consider add this filter to Spark 
> History Server in order to decrease the memory consumption for replaying one 
> history log. For use cases like us, where we have multiple older versions of 
> Spark deployed, this filter should be pretty useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server

2017-09-08 Thread Ye Zhou (JIRA)

Ye Zhou created SPARK-21961:
---

 Summary: Filter out BlockStatuses Accumulators during replaying 
history logs in Spark History Server
 Key: SPARK-21961
 URL: https://issues.apache.org/jira/browse/SPARK-21961
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.0, 2.1.0
Reporter: Ye Zhou


As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of 
memory in Driver. Recently we also noticed the same issue in Spark History 
Server. Even though in SPARK-20084, those event logs are getting removed from 
history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions 
are deployed in our production cluster, none of them have these two patches 
included.
In this case, those event logs will still be in shown up in logs and Spark 
History Server will replay them. Spark History Server continuously get severe 
Full GCs even though we tried to limit cache size as well as enlarge the 
heapsize to 40GB. We also tried with different GC tuning parameters, like using 
CMS or G1GC. None of them works.
We made a heap dump, and found that the top memory consumer objects is 
BlockStatus. There was even one thread that took 24GB heap which was replaying 
one log file.
Since the former two tickets has resolved related issues in both driver and 
writing to history logs, we should also consider add this filter to Spark 
History Server in order to decrease the memory consumption for replaying one 
history log. For use cases like us, where we have multiple older versions of 
Spark deployed, this filter should be pretty useful.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances

2017-09-08 Thread Karthik Palaniappan (JIRA)

Karthik Palaniappan created SPARK-21960:
---

 Summary: Spark Streaming Dynamic Allocation should respect 
spark.executor.instances
 Key: SPARK-21960
 URL: https://issues.apache.org/jira/browse/SPARK-21960
 Project: Spark
  Issue Type: Improvement
  Components: DStreams
Affects Versions: 2.2.0
Reporter: Karthik Palaniappan
Priority: Minor


This check enforces that spark.executor.instances (aka --num-executors) is 
either unset or explicitly set to 0. 
https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207

If spark.executor.instances is unset, the check is fine, and the property 
defaults to 2. Spark requests the cluster manager for 2 executors to start 
with, then adds/removes executors appropriately.

However, if you explicitly set it to 0, the check also succeeds, but Spark 
never asks the cluster manager for any executors. When running on YARN, I 
repeatedly saw:

{code:java}
17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
Initial job has not accepted any resources; check your cluster UI to ensure 
that workers are registered and have sufficient resources
17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
Initial job has not accepted any resources; check your cluster UI to ensure 
that workers are registered and have sufficient resources
17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: 
Initial job has not accepted any resources; check your cluster UI to ensure 
that workers are registered and have sufficient resources
{code}

I noticed that at least Google Dataproc and Ambari explicitly set 
spark.executor.instances to a positive number, meaning that to use dynamic 
allocation, you would have to edit spark-defaults.conf to remove the property. 
That's obnoxious.

In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value 
for --num-executors or --conf spark.executor.instances: 
https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9

It is much more reasonable for Streaming DRA to use spark.executor.instances, 
just like Core DRA. I'll open a pull request to remove the check if there are 
no objections.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API

2017-09-08 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-19866.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add local version of Word2Vec findSynonyms for spark.ml: Python API
> ---
>
> Key: SPARK-19866
> URL: https://issues.apache.org/jira/browse/SPARK-19866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Xin Ren
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add Python API for findSynonymsArray matching Scala API in linked JIRA.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError

2017-09-08 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk resolved SPARK-15243.
-
Resolution: Fixed

> Binarizer.explainParam(u"...") raises ValueError
> 
>
> Key: SPARK-15243
> URL: https://issues.apache.org/jira/browse/SPARK-15243
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: CentOS 7, Spark 1.6.0
>Reporter: Kazuki Yokoishi
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> When unicode is passed to Binarizer.explainParam(), ValueError occurs.
> To reproduce:
> {noformat}
> >>> binarizer = Binarizer(threshold=1.0, inputCol="values", 
> >>> outputCol="features")
> >>> binarizer.explainParam("threshold") # str can be passed
> 'threshold: threshold in binary classification prediction, in range [0, 1] 
> (default: 0.0, current: 1.0)'
> >>> binarizer.explainParam(u"threshold") # unicode cannot be passed
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 binarizer.explainParam(u"threshold")
> /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, 
> param)
>  96 default value and user-supplied value in a string.
>  97 """
> ---> 98 param = self._resolveParam(param)
>  99 values = []
> 100 if self.isDefined(param):
> /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, 
> param)
> 231 return self.getParam(param)
> 232 else:
> --> 233 raise ValueError("Cannot resolve %r as a param." % param)
> 234 
> 235 @staticmethod
> ValueError: Cannot resolve u'threshold' as a param.
> {noformat}
> Same erros occur in other methods.
> * Binarizer.hasDefault()
> * Binarizer.getOrDefault()
> * Binarizer.isSet()
> These errors are caused by checks *isinstance(obj, str)* in 
> pyspark.ml.param.Params._resolveParam().
> basestring should be used instead of str in isinstance() for backward 
> compatibility as below.
> {noformat}
> if sys.version >= '3':
>  basestring = str
> if isinstance(obj, basestring):
> # TODO
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError

2017-09-08 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk updated SPARK-15243:

Fix Version/s: 2.3.0

> Binarizer.explainParam(u"...") raises ValueError
> 
>
> Key: SPARK-15243
> URL: https://issues.apache.org/jira/browse/SPARK-15243
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: CentOS 7, Spark 1.6.0
>Reporter: Kazuki Yokoishi
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> When unicode is passed to Binarizer.explainParam(), ValueError occurs.
> To reproduce:
> {noformat}
> >>> binarizer = Binarizer(threshold=1.0, inputCol="values", 
> >>> outputCol="features")
> >>> binarizer.explainParam("threshold") # str can be passed
> 'threshold: threshold in binary classification prediction, in range [0, 1] 
> (default: 0.0, current: 1.0)'
> >>> binarizer.explainParam(u"threshold") # unicode cannot be passed
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 binarizer.explainParam(u"threshold")
> /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, 
> param)
>  96 default value and user-supplied value in a string.
>  97 """
> ---> 98 param = self._resolveParam(param)
>  99 values = []
> 100 if self.isDefined(param):
> /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, 
> param)
> 231 return self.getParam(param)
> 232 else:
> --> 233 raise ValueError("Cannot resolve %r as a param." % param)
> 234 
> 235 @staticmethod
> ValueError: Cannot resolve u'threshold' as a param.
> {noformat}
> Same erros occur in other methods.
> * Binarizer.hasDefault()
> * Binarizer.getOrDefault()
> * Binarizer.isSet()
> These errors are caused by checks *isinstance(obj, str)* in 
> pyspark.ml.param.Params._resolveParam().
> basestring should be used instead of str in isinstance() for backward 
> compatibility as below.
> {noformat}
> if sys.version >= '3':
>  basestring = str
> if isinstance(obj, basestring):
> # TODO
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError

2017-09-08 Thread holdenk (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

holdenk reassigned SPARK-15243:
---

Assignee: Hyukjin Kwon  (was: Seth Hendrickson)

> Binarizer.explainParam(u"...") raises ValueError
> 
>
> Key: SPARK-15243
> URL: https://issues.apache.org/jira/browse/SPARK-15243
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.0
> Environment: CentOS 7, Spark 1.6.0
>Reporter: Kazuki Yokoishi
>Assignee: Hyukjin Kwon
>Priority: Minor
>
> When unicode is passed to Binarizer.explainParam(), ValueError occurs.
> To reproduce:
> {noformat}
> >>> binarizer = Binarizer(threshold=1.0, inputCol="values", 
> >>> outputCol="features")
> >>> binarizer.explainParam("threshold") # str can be passed
> 'threshold: threshold in binary classification prediction, in range [0, 1] 
> (default: 0.0, current: 1.0)'
> >>> binarizer.explainParam(u"threshold") # unicode cannot be passed
> ---
> ValueErrorTraceback (most recent call last)
>  in ()
> > 1 binarizer.explainParam(u"threshold")
> /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, 
> param)
>  96 default value and user-supplied value in a string.
>  97 """
> ---> 98 param = self._resolveParam(param)
>  99 values = []
> 100 if self.isDefined(param):
> /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, 
> param)
> 231 return self.getParam(param)
> 232 else:
> --> 233 raise ValueError("Cannot resolve %r as a param." % param)
> 234 
> 235 @staticmethod
> ValueError: Cannot resolve u'threshold' as a param.
> {noformat}
> Same erros occur in other methods.
> * Binarizer.hasDefault()
> * Binarizer.getOrDefault()
> * Binarizer.isSet()
> These errors are caused by checks *isinstance(obj, str)* in 
> pyspark.ml.param.Params._resolveParam().
> basestring should be used instead of str in isinstance() for backward 
> compatibility as below.
> {noformat}
> if sys.version >= '3':
>  basestring = str
> if isinstance(obj, basestring):
> # TODO
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158997#comment-16158997
 ] 

Marcelo Vanzin commented on SPARK-18085:


[~jincheng] that is caused by SPARK-17701. The bug is still open but the patch 
has actually been committed, and it removes a property of {{SparkPlanInfo}} 
that makes Spark 2.3 unable to read event logs from earlier versions. Can you 
file a new bug with that information? Thanks.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()

2017-09-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158995#comment-16158995
 ] 

Kazuaki Ishizaki commented on SPARK-21907:
--

If you cannot provide a repro, could you please run your program with the 
latest master branch?
SPARK-21319 may alleviate this issue.

> NullPointerException in UnsafeExternalSorter.spill()
> 
>
> Key: SPARK-21907
> URL: https://issues.apache.org/jira/browse/SPARK-21907
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Juliusz Sompolski
>
> I see NPE during sorting with the following stacktrace:
> {code}
> java.lang.NullPointerException
>   at 
> org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43)
>   at 
> org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270)
>   at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142)
>   at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221)
>   at 
> org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203)
>   at 
> org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281)
>   at 
> org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349)
>   at 
> org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400)
>   at 
> org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at 
> org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685)
>   at 
> org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259)
>   at 
> org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
>   at org.apache.spark.scheduler.Task.run(Task.scala:108)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346)
>   at 
>

[jira] [Resolved] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21959.
---
Resolution: Invalid

There's no detail on the job, and no indication that this is a problem in 
Spark. Your app is just running out of memory. You optimized your app and it 
worked. Not something you report as a JIRA.

> Python RDD goes into never ending garbage collection service when spark 
> submit is triggered in oozie
> 
>
> Key: SPARK-21959
> URL: https://issues.apache.org/jira/browse/SPARK-21959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.1.0
> Environment: Head Node - 2 - 8 cores -55 GB/Node
> Worker Node - 5 - 4 cores - 28 GB/Node 
>Reporter: VP
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> When the job is submitted through spark submit , the code executes fine
> But when called through the oozie , whenever a PythonRDD is triggered , it 
> gets into garbage collecting service which is never ending.
> When the RDD is replaced by Dataframe , the code executes fine.
> Need to understand the proper root cause on why the garbage collection 
> service is invoked only when called through oozir



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen closed SPARK-21959.
-

> Python RDD goes into never ending garbage collection service when spark 
> submit is triggered in oozie
> 
>
> Key: SPARK-21959
> URL: https://issues.apache.org/jira/browse/SPARK-21959
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Submit
>Affects Versions: 2.1.0
> Environment: Head Node - 2 - 8 cores -55 GB/Node
> Worker Node - 5 - 4 cores - 28 GB/Node 
>Reporter: VP
>   Original Estimate: 30h
>  Remaining Estimate: 30h
>
> When the job is submitted through spark submit , the code executes fine
> But when called through the oozie , whenever a PythonRDD is triggered , it 
> gets into garbage collecting service which is never ending.
> When the RDD is replaced by Dataframe , the code executes fine.
> Need to understand the proper root cause on why the garbage collection 
> service is invoked only when called through oozir



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie

2017-09-08 Thread Vega Paleri (JIRA)

Vega Paleri created SPARK-21959:
---

 Summary: Python RDD goes into never ending garbage collection 
service when spark submit is triggered in oozie
 Key: SPARK-21959
 URL: https://issues.apache.org/jira/browse/SPARK-21959
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Submit
Affects Versions: 2.1.0
 Environment: Head Node - 2 - 8 cores -55 GB/Node
Worker Node - 5 - 4 cores - 28 GB/Node 
Reporter: Vega Paleri


When the job is submitted through spark submit , the code executes fine
But when called through the oozie , whenever a PythonRDD is triggered , it gets 
into garbage collecting service which is never ending.

When the RDD is replaced by Dataframe , the code executes fine.

Need to understand the proper root cause on why the garbage collection service 
is invoked only when called through oozir



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21893) Put Kafka 0.8 behind a profile

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21893:
--
Description: 
Kafka does not support 0.8.x for Scala 2.12. This code will have to, at least, 
be optionally enabled by a profile, which could be enabled by default for 2.11. 
Or outright removed.

Update: it'll also require removing 0.8.x examples, because otherwise the 
example module has to be split.

While not necessarily connected, it's probably a decent point to declare 0.8 
deprecated. And that means declaring 0.10 (the other API left) as stable.

  was:Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
least, be optionally enabled by a profile, which could be enabled by default 
for 2.11. Or outright removed.


> Put Kafka 0.8 behind a profile
> --
>
> Key: SPARK-21893
> URL: https://issues.apache.org/jira/browse/SPARK-21893
> Project: Spark
>  Issue Type: Sub-task
>  Components: DStreams
>Affects Versions: 2.2.0
>Reporter: Sean Owen
>Priority: Minor
>
> Kafka does not support 0.8.x for Scala 2.12. This code will have to, at 
> least, be optionally enabled by a profile, which could be enabled by default 
> for 2.11. Or outright removed.
> Update: it'll also require removing 0.8.x examples, because otherwise the 
> example module has to be split.
> While not necessarily connected, it's probably a decent point to declare 0.8 
> deprecated. And that means declaring 0.10 (the other API left) as stable.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Kevin Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158948#comment-16158948
 ] 

Kevin Zhang commented on SPARK-21944:
-

[~mgaido] Do you mean the following way by saying "define the watermark on the 
column 'time' "？

{code:java}
val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
  .withWatermark("time", "10 seconds")
  .dropDuplicates("id", "window")
  .groupBy("window")
  .count
{code}
I don't know whether this is right, because the documentation indicates we 
should use the same column as is used in watermark, that is "time" column(which 
is not what I want). I tried this way and the application dosen't throw any 
exception, but it didn't drop events older than the watermark as expected. In 
the following example, after the batch containing an event with 
time=1504774540(2017/9/7 16:55:40 CST) is processed(the watermark should be 
adjust to 2017/9/7 16:55:30 CST), then I send an event with 
time=1504745724(2017/9/7 8:55:24 CST), this event is processed instead of being 
dropped as expected.

{code:java}
+-+-+   
|window   |count|
+-+-+
|[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
|[2017-09-07 08:55:20.0,2017-09-07 08:55:25.0]|1|
|[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
+-+-+

{min=2017-09-07T00:55:24.000Z, avg=2017-09-07T00:55:24.000Z, 
watermark=2017-09-07T08:55:30.000Z, max=2017-09-07T00:55:24.000Z}
{code}

Here is one thing important I have to say, that is my time zone is CST, instead 
of UTC. The start and end time in window is right, but the watermark is 
reported in UTC. I don't know whether this influences.

If I didn't make everything clear, please point it and I will explain. Thanks




> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark

[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-09-08 Thread Ryan Blue (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158924#comment-16158924
 ] 

Ryan Blue commented on SPARK-20958:
---

[~spiricalsalsaz], you need to only pin parquet-avro, not the other Parquet 
libs. This is caused by a bug in Parquet that has been fixed in 1.8.2, so you 
want the 1.8.2 version of parquet-hadoop, but the 1.8.1 version of 
parquet-avro. Alternatively, you can shade and relocate the version of Avro you 
want and use parquet-avro 1.8.2. That's what I'd recommend.

> Roll back parquet-mr 1.8.2 to parquet-1.8.1
> ---
>
> Key: SPARK-20958
> URL: https://issues.apache.org/jira/browse/SPARK-20958
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Cheng Lian
>Assignee: Cheng Lian
>  Labels: release-notes, release_notes, releasenotes
>
> We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on 
> avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 
> and avro 1.7.7 used by spark-core 2.2.0-rc2.
> Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro 
> (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the 
> reasons mentioned in [PR 
> #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. 
> Therefore, we don't really have many choices here and have to roll back 
> parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21128) Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"

2017-09-08 Thread Felix Cheung (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Felix Cheung updated SPARK-21128:
-
Target Version/s: 2.2.1, 2.3.0  (was: 2.3.0)
   Fix Version/s: 2.2.1

> Running R tests multiple times failed due to pre-exiting "spark-warehouse" / 
> "metastore_db"
> ---
>
> Key: SPARK-21128
> URL: https://issues.apache.org/jira/browse/SPARK-21128
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> Currently, running R tests multiple times fails due to pre-exiting 
> "spark-warehouse" / "metastore_db" as below:
> {code}
> SparkSQL functions: Spark package found in SPARK_HOME: .../spark
> ...1234...
> {code}
> {code}
> Failed 
> -
> 1. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 2. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> 3. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 4. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> DONE 
> ===
> {code}
> It looks we should remove both "spark-warehouse" and "metastore_db" _before_ 
> listing files into  {{sparkRFilesBefore}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21946.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1

> Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
> 
>
> Key: SPARK-21946
> URL: https://issues.apache.org/jira/browse/SPARK-21946
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> According to the [Apache Spark Jenkins 
> History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/]
> InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. 
> We had better stablize this.
> {code}
> - alter table: rename cached table !!! CANCELED !!!
>   Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data 
> (DDLSuite.scala:786)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-21946:
---

Assignee: Kazuaki Ishizaki

> Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
> 
>
> Key: SPARK-21946
> URL: https://issues.apache.org/jira/browse/SPARK-21946
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Kazuaki Ishizaki
>Priority: Minor
> Fix For: 2.2.1, 2.3.0
>
>
> According to the [Apache Spark Jenkins 
> History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/]
> InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. 
> We had better stablize this.
> {code}
> - alter table: rename cached table !!! CANCELED !!!
>   Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data 
> (DDLSuite.scala:786)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-21936:

Fix Version/s: 2.2.1

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.2.1, 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-08 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158865#comment-16158865
 ] 

Marcelo Vanzin commented on SPARK-18085:


Do you really mean fixed, as in you're not seeing it anymore, or introduced?

Anyway, I'll take a look; might be something that changed since I last rebased 
my branch.

> SPIP: Better History Server scalability for many / large applications
> -
>
> Key: SPARK-18085
> URL: https://issues.apache.org/jira/browse/SPARK-18085
> Project: Spark
>  Issue Type: Umbrella
>  Components: Spark Core, Web UI
>Affects Versions: 2.0.0
>Reporter: Marcelo Vanzin
>  Labels: SPIP
> Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf
>
>
> It's a known fact that the History Server currently has some annoying issues 
> when serving lots of applications, and when serving large applications.
> I'm filing this umbrella to track work related to addressing those issues. 
> I'll be attaching a document shortly describing the issues and suggesting a 
> path to how to solve them.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python

2017-09-08 Thread Li Jin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158845#comment-16158845
 ] 

Li Jin commented on SPARK-21190:


To [~bryanc]'s point, PR [#18659|https://github.com/apache/spark/pull/18659] 
and PR [#19147|https://github.com/apache/spark/pull/19147] are largely similar 
and it makes sense not to keep two PR for the same thing.

Also, what I am curious about is what kind of guide lines to follow to avoid 
such duplicate work in the future. IMHO, [~bryanc] has been linked his PR to 
this Jira a while back and has been actively engaging in all discussions, so I 
am not sure why do we need a similar second PR in this case. (Of course, 
however, if people think #18659 and #19147 are very different, that's another 
story). 

I have worked with [~bryanc] in the past together on SPARK-13534 and 
collaborating on the same branch worked quite well for us. Maybe that's 
something we should encourage?



> SPIP: Vectorized UDFs in Python
> ---
>
> Key: SPARK-21190
> URL: https://issues.apache.org/jira/browse/SPARK-21190
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>  Labels: SPIP
> Attachments: SPIPVectorizedUDFsforPython (1).pdf
>
>
> *Background and Motivation*
> Python is one of the most popular programming languages among Spark users. 
> Spark currently exposes a row-at-a-time interface for defining and executing 
> user-defined functions (UDFs). This introduces high overhead in serialization 
> and deserialization, and also makes it difficult to leverage Python libraries 
> (e.g. numpy, Pandas) that are written in native code.
>  
> This proposal advocates introducing new APIs to support vectorized UDFs in 
> Python, in which a block of data is transferred over to Python in some 
> columnar format for execution.
>  
>  
> *Target Personas*
> Data scientists, data engineers, library developers.
>  
> *Goals*
> - Support vectorized UDFs that apply on chunks of the data frame
> - Low system overhead: Substantially reduce serialization and deserialization 
> overhead when compared with row-at-a-time interface
> - UDF performance: Enable users to leverage native libraries in Python (e.g. 
> numpy, Pandas) for data manipulation in these UDFs
>  
> *Non-Goals*
> The following are explicitly out of scope for the current SPIP, and should be 
> done in future SPIPs. Nonetheless, it would be good to consider these future 
> use cases during API design, so we can achieve some consistency when rolling 
> out new APIs.
>  
> - Define block oriented UDFs in other languages (that are not Python).
> - Define aggregate UDFs
> - Tight integration with machine learning frameworks
>  
> *Proposed API Changes*
> The following sketches some possibilities. I haven’t spent a lot of time 
> thinking about the API (wrote it down in 5 mins) and I am not attached to 
> this design at all. The main purpose of the SPIP is to get feedback on use 
> cases and see how they can impact API design.
>  
> A few things to consider are:
>  
> 1. Python is dynamically typed, whereas DataFrames/SQL requires static, 
> analysis time typing. This means users would need to specify the return type 
> of their UDFs.
>  
> 2. Ratio of input rows to output rows. We propose initially we require number 
> of output rows to be the same as the number of input rows. In the future, we 
> can consider relaxing this constraint with support for vectorized aggregate 
> UDFs.
> 3. How do we handle null values, since Pandas doesn't have the concept of 
> nulls?
>  
> Proposed API sketch (using examples):
>  
> Use case 1. A function that defines all the columns of a DataFrame (similar 
> to a “map” function):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_on_entire_df(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A Pandas data frame.
>   """
>   input[c] = input[a] + input[b]
>   Input[d] = input[a] - input[b]
>   return input
>  
> spark.range(1000).selectExpr("id a", "id / 2 b")
>   .mapBatches(my_func_on_entire_df)
> {code}
>  
> Use case 2. A function that defines only one column (similar to existing 
> UDFs):
>  
> {code}
> @spark_udf(some way to describe the return schema)
> def my_func_that_returns_one_column(input):
>   """ Some user-defined function.
>  
>   :param input: A Pandas DataFrame with two columns, a and b.
>   :return: :class: A numpy array
>   """
>   return input[a] + input[b]
>  
> my_func = udf(my_func_that_returns_one_column)
>  
> df = spark.range(1000).selectExpr("id a", "id / 2 b")
> df.withColumn("c", my_func(df.a, df.b))
> {code}
>  
>  
>  
> *Optional Design Sketch*
> I’m more

[jira] [Created] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.

2017-09-08 Thread Travis Hegner (JIRA)

Travis Hegner created SPARK-21958:
-

 Summary: Attempting to save large Word2Vec model hangs driver in 
constant GC.
 Key: SPARK-21958
 URL: https://issues.apache.org/jira/browse/SPARK-21958
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
 Environment: Running spark on yarn, hadoop 2.7.2 provided by the 
cluster
Reporter: Travis Hegner


In the new version of Word2Vec, the model saving was modified to estimate an 
appropriate number of partitions based on the kryo buffer size. This is a great 
improvement, but there is a caveat for very large models.

The {{(word, vector)}} tuple goes through a transformation to a local case 
class of {{Data(word, vector)}}... I can only assume this is for the kryo 
serialization process. The new version of the code iterates over the entire 
vocabulary to do this transformation (the old version wrapped the entire datum) 
in the driver's heap. Only to have the result then distributed to the cluster 
to be written into it's parquet files.

With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, 
and tri-grams), that local driver transformation is causing the driver to hang 
indefinitely in GC as I can only assume that it's generating millions of short 
lived objects which can't be evicted fast enough.

Perhaps I'm overlooking something, but it seems to me that since the result is 
distributed over the cluster to be saved _after_ the transformation anyway, we 
may as well distribute it _first_, allowing the cluster resources to do the 
transformation more efficiently, and then write the parquet file from there.

I have a patch implemented, and am in the process of testing it at scale. I 
will open a pull request when I feel that the patch is successfully resolving 
the issue, and after making sure that it passes unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-09-08 Thread Dominic Ricard (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158802#comment-16158802
 ] 

Dominic Ricard commented on SPARK-21067:


[~zhangxin0112zx] Our solution was to migrate the CTAS code to use Parquet... 
CTAS for Hive Tables is Broken when using the Thrift server.

Still looking forward for a fix to this issue...

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at

[jira] [Assigned] (SPARK-21957) Add current_user function

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21957:


Assignee: Apache Spark

> Add current_user function
> -
>
> Key: SPARK-21957
> URL: https://issues.apache.org/jira/browse/SPARK-21957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Assignee: Apache Spark
>Priority: Minor
>
> Spark doesn't support the {{current_user}} function.
> Despite the user can be retrieved in other ways, the function would help 
> making easier to migrate existing Hive queries to Spark and it can also be 
> convenient for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21957) Add current_user function

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158745#comment-16158745
 ] 

Apache Spark commented on SPARK-21957:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/19169

> Add current_user function
> -
>
> Key: SPARK-21957
> URL: https://issues.apache.org/jira/browse/SPARK-21957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Priority: Minor
>
> Spark doesn't support the {{current_user}} function.
> Despite the user can be retrieved in other ways, the function would help 
> making easier to migrate existing Hive queries to Spark and it can also be 
> convenient for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21957) Add current_user function

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21957:


Assignee: (was: Apache Spark)

> Add current_user function
> -
>
> Key: SPARK-21957
> URL: https://issues.apache.org/jira/browse/SPARK-21957
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Marco Gaido
>Priority: Minor
>
> Spark doesn't support the {{current_user}} function.
> Despite the user can be retrieved in other ways, the function would help 
> making easier to migrate existing Hive queries to Spark and it can also be 
> convenient for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21957) Add current_user function

2017-09-08 Thread Marco Gaido (JIRA)

Marco Gaido created SPARK-21957:
---

 Summary: Add current_user function
 Key: SPARK-21957
 URL: https://issues.apache.org/jira/browse/SPARK-21957
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Marco Gaido
Priority: Minor


Spark doesn't support the {{current_user}} function.

Despite the user can be retrieved in other ways, the function would help making 
easier to migrate existing Hive queries to Spark and it can also be convenient 
for people who are just using SQL to interact with Spark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1

2017-09-08 Thread Anthony Dotterer (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155387#comment-16155387
 ] 

Anthony Dotterer edited comment on SPARK-20958 at 9/8/17 2:27 PM:
--

As a user of Spark 2.2.0 that mixes usage of parquet-avro and avro, here is the 
exceptions that I had.  This will hopefully make search engines find this 
library conflict more quickly for others.

{code}
java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
at 
org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
at 
org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35)
at 
org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173)
...
Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType
{code}

Also when attempting to pin parquet-avro to 1.8.1 with SBT, I'll get the 
following exception when attempting to write output:

{code}

java.lang.ExceptionInInitializerError
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:446)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:446)
at scala.collection.immutable.List.foreach(List.scala:381)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:446)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:142)
at 
org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438)
at 
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474)
at 
org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at 
org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at 
org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
at 
org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:509)
...
Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can 
not be empty. Parquet does not support empty group without leaves. Empty group: 
spark_schema
at org.apache.parquet.schema.GroupType.(GroupType.java:92)

[jira] [Assigned] (SPARK-21956) Fetch up to max bytes when buf really released

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21956:


Assignee: Apache Spark

> Fetch up to max bytes when buf really released
> --
>
> Key: SPARK-21956
> URL: https://issues.apache.org/jira/browse/SPARK-21956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhoukang
>Assignee: Apache Spark
>
> Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily 
> when take from results queue.But currentresult's bytebuf has not been 
> released at that time.Then direct memory may be a little out of control.
> We should decrease bytesInFlight when currentresult's bytebuf really been 
> released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21956) Fetch up to max bytes when buf really released

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21956:


Assignee: (was: Apache Spark)

> Fetch up to max bytes when buf really released
> --
>
> Key: SPARK-21956
> URL: https://issues.apache.org/jira/browse/SPARK-21956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily 
> when take from results queue.But currentresult's bytebuf has not been 
> released at that time.Then direct memory may be a little out of control.
> We should decrease bytesInFlight when currentresult's bytebuf really been 
> released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21956) Fetch up to max bytes when buf really released

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158548#comment-16158548
 ] 

Apache Spark commented on SPARK-21956:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/19168

> Fetch up to max bytes when buf really released
> --
>
> Key: SPARK-21956
> URL: https://issues.apache.org/jira/browse/SPARK-21956
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Affects Versions: 2.1.0
>Reporter: zhoukang
>
> Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily 
> when take from results queue.But currentresult's bytebuf has not been 
> released at that time.Then direct memory may be a little out of control.
> We should decrease bytesInFlight when currentresult's bytebuf really been 
> released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21956) Fetch up to max bytes when buf really released

2017-09-08 Thread zhoukang (JIRA)

zhoukang created SPARK-21956:


 Summary: Fetch up to max bytes when buf really released
 Key: SPARK-21956
 URL: https://issues.apache.org/jira/browse/SPARK-21956
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, Spark Core
Affects Versions: 2.1.0
Reporter: zhoukang


Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily 
when take from results queue.But currentresult's bytebuf has not been released 
at that time.Then direct memory may be a little out of control.
We should decrease bytesInFlight when currentresult's bytebuf really been 
released.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-21942:
--
Affects Version/s: (was: 2.2.1)
   (was: 2.3.0)
   (was: 3.0.0)
   (was: 2.0.2)
   (was: 1.6.3)
 Target Version/s:   (was: 2.3.0)
Fix Version/s: (was: 2.3.0)

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table

2017-09-08 Thread Kazuaki Ishizaki (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158496#comment-16158496
 ] 

Kazuaki Ishizaki commented on SPARK-21905:
--

While I ran the following code (I do not have PointUDT and Point classes), I 
cannot see the exception using master branch or branch-2.2.

{code}
...
import org.apache.spark.sql.catalyst.encoders._
...
import org.apache.spark.sql.types._

  test("SPARK-21905") {
val schema = StructType(List(
  StructField("name", DataTypes.StringType, true),
  StructField("location", new ExamplePointUDT, true)))

val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 4)
  .map({ x: String => Row.fromSeq(Seq(x, new ExamplePoint(100, 100))) })
val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
dataFrame.createOrReplaceTempView("person")
sqlContext.sql("SELECT * FROM person").foreach(println(_))
  }
{code}

> ClassCastException when call sqlContext.sql on temp table
> -
>
> Key: SPARK-21905
> URL: https://issues.apache.org/jira/browse/SPARK-21905
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: bluejoe
>
> {code:java}
> val schema = StructType(List(
>   StructField("name", DataTypes.StringType, true),
>   StructField("location", new PointUDT, true)))
> val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 
> 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) });
> val dataFrame = sqlContext.createDataFrame(rowRdd, schema)
> dataFrame.createOrReplaceTempView("person");
> sqlContext.sql("SELECT * FROM person").foreach(println(_));
> {code}
> the last statement throws exception:
> {code:java}
> Caused by: java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to 
> org.apache.spark.sql.catalyst.InternalRow
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
>   ... 18 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21951) Unable to add the new column and writing into the Hive using spark

2017-09-08 Thread jalendhar Baddam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jalendhar Baddam reopened SPARK-21951:
--

This is still exist. And it's getting the AnalysisException.

> Unable to add the new column and writing into the Hive using spark
> --
>
> Key: SPARK-21951
> URL: https://issues.apache.org/jira/browse/SPARK-21951
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> I am creating one new column to the Existing Dataset and unable to write into 
> the Hive using the Spark.
> Ex: Dataset ds=spark.sql("select *from Table");
>  ds= ds.withColumn("newColumn",newColumnvalues);
> ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); 
> //Here I am getting the Exception
> I am loading the Table from Hive using Spark, and  adding the new Column to 
> that Dataset and again write the same table into Hive with the "OverWrite" 
> option



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Reopened] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java

2017-09-08 Thread jalendhar Baddam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

jalendhar Baddam reopened SPARK-21952:
--

This is still exist ,Please re check 

> Unable to load the csv file into Dataset  using Spark with java
> ---
>
> Key: SPARK-21952
> URL: https://issues.apache.org/jira/browse/SPARK-21952
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> Hi,
> I am trying to load the one csv file using spark with java ,The csv file 
> contains the one row with two end lines.I am attaching the csv file .Placing 
> the sample csv file content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet

2017-09-08 Thread Andriy Kushnir (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158444#comment-16158444
 ] 

Andriy Kushnir commented on SPARK-4502:
---

Just tried this patch on Spark 2.2.0
There are *really huge* performance boost, 5× ≈ 40× approx.
[~michael], thanks!

> Spark SQL reads unneccesary nested fields from Parquet
> --
>
> Key: SPARK-4502
> URL: https://issues.apache.org/jira/browse/SPARK-4502
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Liwen Sun
>Priority: Critical
>
> When reading a field of a nested column from Parquet, SparkSQL reads and 
> assemble all the fields of that nested column. This is unnecessary, as 
> Parquet supports fine-grained field reads out of a nested column. This may 
> degrades the performance significantly when a nested column has many fields. 
> For example, I loaded json tweets data into SparkSQL and ran the following 
> query:
> {{SELECT User.contributors_enabled from Tweets;}}
> User is a nested structure that has 38 primitive fields (for Tweets schema, 
> see: https://dev.twitter.com/overview/api/tweets), here is the log message:
> {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 
> cell/ms}}
> For comparison, I also ran:
> {{SELECT User FROM Tweets;}}
> And here is the log message:
> {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed 
> 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}}
> So both queries load 38 columns from Parquet, while the first query only 
> needs 1 column. I also measured the bytes read within Parquet. In these two 
> cases, the same number of bytes (99365194 bytes) were read. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 10:31 AM:
--

[~KevinZwx] you should define the watermark on the column {{"time"}}, not the 
column {{"window"}}


was (Author: mgaido):
[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-19090) Dynamic Resource Allocation not respecting spark.executor.cores

2017-09-08 Thread Carlos Vicenti (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157932#comment-16157932
 ] 

Carlos Vicenti edited comment on SPARK-19090 at 9/8/17 10:19 AM:
-

I have found the same issue while using Hive On Spark (on Yarn) and 
spark.dynamicAllocation.enabled set to true
{noformat}
SET spark.executor.cores=4;
SET spark.executor.memory=21G;
SET spark.yarn.executor.memoryOverhead=3813;
{noformat}

>From the application logs:
{noformat}
17/09/08 00:30:34 INFO yarn.YarnAllocator: Will request 1 executor containers, 
each with 6 cores and 25317 MB memory including 3813 MB overhead
{noformat}

As mentioned above. This does not happen if I set 
spark.dynamicAllocation.enabled to false.
I'm using v1.6


was (Author: cvicenti):
I have found the same issue while using Hive On Spark (on Yarn) and 
spark.dynamicAllocation.enabled set to true
{noformat}
SET spark.executor.cores=4;
SET spark.executor.memory=21G;
SET spark.yarn.executor.memoryOverhead=3813;
{noformat}

>From the application logs:
{noformat}
17/09/08 00:30:34 INFO yarn.YarnAllocator: Will request 1 executor containers, 
each with 6 cores and 25317 MB memory including 3813 MB overhead
{noformat}

As mentioned above. This does not happen if I set 
spark.dynamicAllocation.enabled to false

> Dynamic Resource Allocation not respecting spark.executor.cores
> ---
>
> Key: SPARK-19090
> URL: https://issues.apache.org/jira/browse/SPARK-19090
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.2, 1.6.1, 2.0.1
>Reporter: nirav patel
>
> When enabling dynamic scheduling with yarn I see that all executors are using 
> only 1 core even if I specify "spark.executor.cores" to 6. If dynamic 
> scheduling is disabled then each executors will have 6 cores. i.e. it 
> respects  "spark.executor.cores". I have tested this against spark 1.5 . I 
> think it will be the same behavior with 2.x as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido edited comment on SPARK-21944 at 9/8/17 9:57 AM:
-

[~KevinZwx] you should define the watermark on the column `"time"`, not the 
column `"window"`


was (Author: mgaido):
[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21944) Watermark on window column is wrong

2017-09-08 Thread Marco Gaido (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406
 ] 

Marco Gaido commented on SPARK-21944:
-

[~kevinzhang] you should define the watermark on the column `"time"`, not the 
column `"window"`

> Watermark on window column is wrong
> ---
>
> Key: SPARK-21944
> URL: https://issues.apache.org/jira/browse/SPARK-21944
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Kevin Zhang
>
> When I use a watermark with dropDuplicates in the following way, the 
> watermark is calculated wrong
> {code:java}
> val counts = events.select(window($"time", "5 seconds"), $"time", $"id")
>   .withWatermark("window", "10 seconds")
>   .dropDuplicates("id", "window")
>   .groupBy("window")
>   .count
> {code}
> where events is a dataframe with a timestamp column "time" and long column 
> "id".
> I registered a listener to print the event time stats in each batch, and the 
> results is like the following
> {code:shell}
> ---
> Batch: 0
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> {watermark=1970-01-01T00:00:00.000Z}
> ---
> Batch: 1
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> ---
> Batch: 2
> ---
> +-+-+ 
>   
> |window   |count|
> +-+-+
> |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1|
> |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4|
> +-+-+
> {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, 
> watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z}
> {watermark=1970-01-01T19:05:09.476Z}
> {code}
> As can be seen, the event time stats are wrong which are always in 
> 1970-01-01, so the watermark is calculated wrong.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-08 Thread Sean Owen (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158402#comment-16158402
 ] 

Sean Owen commented on SPARK-21955:
---

You might be on to something, but this is poorly described. Can you revise the 
description, attach the image, and specify the change you are suggesting?

> OneForOneStreamManager may leak memory when network is poor
> ---
>
> Key: SPARK-21955
> URL: https://issues.apache.org/jira/browse/SPARK-21955
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager
>Affects Versions: 1.6.1
> Environment: hdp 2.4.2.0-258 
> spark 1.6 
>Reporter: poseidon
>
> just in my way to know how  stream , chunk , block works in netty found some 
> nasty case.
> process OpenBlocks message registerStream Stream in OneForOneStreamManager
> org.apache.spark.network.server.OneForOneStreamManager#registerStream
> fill with streamState with app & buber 
> process  ChunkFetchRequest registerChannel
> org.apache.spark.network.server.OneForOneStreamManager#registerChannel
> fill with streamState with channel 
> In 
> org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 
> OpenBlocks  -> ChunkFetchRequest   come in sequnce. 
> If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
> then. 
> So, we can see some leaked Buffer in OneForOneStreamManager
> !attachment-name.jpg|thumbnail!
> if 
> org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
>   is not set, then after search the code , it will remain in memory forever. 
> Because the only way to release it was in channel close , or someone read the 
> last piece of block. 
> OneForOneStreamManager#registerStream we can set channel in this method, just 
> in case of this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor

2017-09-08 Thread poseidon (JIRA)

poseidon created SPARK-21955:


 Summary: OneForOneStreamManager may leak memory when network is 
poor
 Key: SPARK-21955
 URL: https://issues.apache.org/jira/browse/SPARK-21955
 Project: Spark
  Issue Type: Bug
  Components: Block Manager
Affects Versions: 1.6.1
 Environment: hdp 2.4.2.0-258 
spark 1.6 
Reporter: poseidon


just in my way to know how  stream , chunk , block works in netty found some 
nasty case.

process OpenBlocks message registerStream Stream in OneForOneStreamManager
org.apache.spark.network.server.OneForOneStreamManager#registerStream
fill with streamState with app & buber 

process  ChunkFetchRequest registerChannel
org.apache.spark.network.server.OneForOneStreamManager#registerChannel
fill with streamState with channel 

In 
org.apache.spark.network.shuffle.OneForOneBlockFetcher#start 

OpenBlocks  -> ChunkFetchRequest   come in sequnce. 

If network down in OpenBlocks  process, no more ChunkFetchRequest  message 
then. 

So, we can see some leaked Buffer in OneForOneStreamManager

!attachment-name.jpg|thumbnail!

if 
org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel
  is not set, then after search the code , it will remain in memory forever. 

Because the only way to release it was in channel close , or someone read the 
last piece of block. 


OneForOneStreamManager#registerStream we can set channel in this method, just 
in case of this case.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158320#comment-16158320
 ] 

Saisai Shao commented on SPARK-21942:
-

Personally I would like to fail fast if such things happened, here it happened 
to clean the root folder and using {{mkdirs}} can handle this issue, but if 
some persistent block or shuffle index file is removed (because it is closed), 
I think there's no way to handle it. So instead of trying to workaround it, 
exposing an exception to user might be more useful, and will let user to know 
the issue earlier.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Ruslan Shestopalyuk (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158304#comment-16158304
 ] 

Ruslan Shestopalyuk commented on SPARK-21942:
-

[~jerryshao] I believe the only objective reason here would be to make the 
Spark code more robust. 

Regarding the rest - I agree it's not a valid issue, since if problem like this 
happens, one can always spend some time debugging the Spark code and realize 
what a workaround could be.

Also, hopefully this very page gets indexed in the search engines, so maybe 
even that won't be needed :) 


> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158300#comment-16158300
 ] 

Apache Spark commented on SPARK-21954:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19167

> JacksonUtils should verify MapType's value type instead of key type
> ---
>
> Key: SPARK-21954
> URL: https://issues.apache.org/jira/browse/SPARK-21954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{JacksonUtils.verifySchema}} verifies if a data type can be converted to 
> JSON. For {{MapType}}, it now verifies the key type. However, in 
> {{JacksonGenerator}}, when converting a map to JSON, we only care about its 
> values and create a writer for the values. The keys in a map are treated as 
> strings by calling {{toString}} on the keys.
> Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
> of {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21954:


Assignee: (was: Apache Spark)

> JacksonUtils should verify MapType's value type instead of key type
> ---
>
> Key: SPARK-21954
> URL: https://issues.apache.org/jira/browse/SPARK-21954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>
> {{JacksonUtils.verifySchema}} verifies if a data type can be converted to 
> JSON. For {{MapType}}, it now verifies the key type. However, in 
> {{JacksonGenerator}}, when converting a map to JSON, we only care about its 
> values and create a writer for the values. The keys in a map are treated as 
> strings by calling {{toString}} on the keys.
> Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
> of {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21954:


Assignee: Apache Spark

> JacksonUtils should verify MapType's value type instead of key type
> ---
>
> Key: SPARK-21954
> URL: https://issues.apache.org/jira/browse/SPARK-21954
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>
> {{JacksonUtils.verifySchema}} verifies if a data type can be converted to 
> JSON. For {{MapType}}, it now verifies the key type. However, in 
> {{JacksonGenerator}}, when converting a map to JSON, we only care about its 
> values and create a writer for the values. The keys in a map are treated as 
> strings by calling {{toString}} on the keys.
> Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
> of {{MapType}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type

2017-09-08 Thread Liang-Chi Hsieh (JIRA)

Liang-Chi Hsieh created SPARK-21954:
---

 Summary: JacksonUtils should verify MapType's value type instead 
of key type
 Key: SPARK-21954
 URL: https://issues.apache.org/jira/browse/SPARK-21954
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.2.0
Reporter: Liang-Chi Hsieh


{{JacksonUtils.verifySchema}} verifies if a data type can be converted to JSON. 
For {{MapType}}, it now verifies the key type. However, in 
{{JacksonGenerator}}, when converting a map to JSON, we only care about its 
values and create a writer for the values. The keys in a map are treated as 
strings by calling {{toString}} on the keys.

Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type 
of {{MapType}}.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21128) Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158276#comment-16158276
 ] 

Apache Spark commented on SPARK-21128:
--

User 'felixcheung' has created a pull request for this issue:
https://github.com/apache/spark/pull/19166

> Running R tests multiple times failed due to pre-exiting "spark-warehouse" / 
> "metastore_db"
> ---
>
> Key: SPARK-21128
> URL: https://issues.apache.org/jira/browse/SPARK-21128
> Project: Spark
>  Issue Type: Test
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Minor
> Fix For: 2.3.0
>
>
> Currently, running R tests multiple times fails due to pre-exiting 
> "spark-warehouse" / "metastore_db" as below:
> {code}
> SparkSQL functions: Spark package found in SPARK_HOME: .../spark
> ...1234...
> {code}
> {code}
> Failed 
> -
> 1. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 2. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3384)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> 3. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> length(list1) not equal to length(list2).
> 1/1 mismatches
> [1] 25 - 23 == 2
> 4. Failure: No extra files are created in SPARK_HOME by starting session and 
> making calls (@test_sparkSQL.R#3388)
> sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE).
> 10/25 mismatches
> x[16]: "metastore_db"
> y[16]: "pkg"
> x[17]: "pkg"
> y[17]: "R"
> x[18]: "R"
> y[18]: "README.md"
> x[19]: "README.md"
> y[19]: "run-tests.sh"
> x[20]: "run-tests.sh"
> y[20]: "SparkR_2.2.0.tar.gz"
> x[21]: "metastore_db"
> y[21]: "pkg"
> x[22]: "pkg"
> y[22]: "R"
> x[23]: "R"
> y[23]: "README.md"
> x[24]: "README.md"
> y[24]: "run-tests.sh"
> x[25]: "run-tests.sh"
> y[25]: "SparkR_2.2.0.tar.gz"
> DONE 
> ===
> {code}
> It looks we should remove both "spark-warehouse" and "metastore_db" _before_ 
> listing files into  {{sparkRFilesBefore}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS

2017-09-08 Thread Saisai Shao (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158271#comment-16158271
 ] 

Saisai Shao commented on SPARK-21942:
-

{quote}
https://github.com/search?utf8=%E2%9C%93=filename%3Aspark-defaults.conf++NOT+spark.local.dir=Code

shows 2000+ repos that omit the `spark.local.dir` setting altogether, which 
means they are using `/tmp`, even though it's not a good default choice.
Which of course does not prove anything, since those are not necessarily 
"production environments".
{quote}

[~rshest] you can always find out reasons, but I don't think this is a valid 
issue.

> DiskBlockManager crashing when a root local folder has been externally 
> deleted by OS
> 
>
> Key: SPARK-21942
> URL: https://issues.apache.org/jira/browse/SPARK-21942
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 
> 2.2.0, 2.2.1, 2.3.0, 3.0.0
>Reporter: Ruslan Shestopalyuk
>Priority: Minor
>  Labels: storage
> Fix For: 2.3.0
>
>
> _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be 
> configured via _spark.local.dir_ option, and which defaults to the system's 
> _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the 
> _YY_ part is a hash bit, to spread files evenly.
> Function _DiskBlockManager.getFile_ expects the top level directories 
> (_blockmgr-XXX..._) to always exist (they get created once, when the spark 
> context is first created), otherwise it would fail with a message like:
> {code}
> ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY
> {code}
> However, this may not always be the case.
> In particular, *if it's the default _/tmp_ folder*, there can be different 
> strategies of automatically removing files from it, depending on the OS:
> * on the boot time
> * on a regular basis (e.g. once per day via a system cron job)
> * based on the file age
> The symptom is that after the process (in our case, a service) using spark is 
> running for a while (a few days), it may not be able to load files anymore, 
> since the top-level scratch directories are not there and 
> _DiskBlockManager.getFile_ crashes.
> Please note that this is different from people arbitrarily removing files 
> manually.
> We have both the facts that _/tmp_ is the default in the spark config and 
> that the system has the right to tamper with its contents, and will do it 
> with a high probability, after some period of time.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158263#comment-16158263
 ] 

Apache Spark commented on SPARK-21953:
--

User 'ash211' has created a pull request for this issue:
https://github.com/apache/spark/pull/19164

> Show both memory and disk bytes spilled if either is present
> 
>
> Key: SPARK-21953
> URL: https://issues.apache.org/jira/browse/SPARK-21953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Minor
>
> https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
>  should be {{||}} not {{&&}}
> As written now, there must be both memory and disk bytes spilled to show 
> either of them.  If there is only one of those types of spill recorded, it 
> will be hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21953:


Assignee: Apache Spark

> Show both memory and disk bytes spilled if either is present
> 
>
> Key: SPARK-21953
> URL: https://issues.apache.org/jira/browse/SPARK-21953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Assignee: Apache Spark
>Priority: Minor
>
> https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
>  should be {{||}} not {{&&}}
> As written now, there must be both memory and disk bytes spilled to show 
> either of them.  If there is only one of those types of spill recorded, it 
> will be hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21953:


Assignee: (was: Apache Spark)

> Show both memory and disk bytes spilled if either is present
> 
>
> Key: SPARK-21953
> URL: https://issues.apache.org/jira/browse/SPARK-21953
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.0
>Reporter: Andrew Ash
>Priority: Minor
>
> https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
>  should be {{||}} not {{&&}}
> As written now, there must be both memory and disk bytes spilled to show 
> either of them.  If there is only one of those types of spill recorded, it 
> will be hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21953) Show both memory and disk bytes spilled if either is present

2017-09-08 Thread Andrew Ash (JIRA)

Andrew Ash created SPARK-21953:
--

 Summary: Show both memory and disk bytes spilled if either is 
present
 Key: SPARK-21953
 URL: https://issues.apache.org/jira/browse/SPARK-21953
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 2.2.0
Reporter: Andrew Ash
Priority: Minor


https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61
 should be {{||}} not {{&&}}

As written now, there must be both memory and disk bytes spilled to show either 
of them.  If there is only one of those types of spill recorded, it will be 
hidden.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications

2017-09-08 Thread jincheng (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158247#comment-16158247
 ] 

jincheng commented on SPARK-18085:
--


{code:java}
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized 
field "metadata" (class org.apache.spark.sql.execution.SparkPlanInfo), not 
marked as ignorable (4 known properties: "simpleString", "nodeName", 
"children", "metrics"])
 at [Source: 
{"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json
 at 
NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native
 
Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"==
 Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: 
array\nRepartition 200, true\n+- LogicalRDD [uid#327L, gids#328]\n\n== 
Optimized Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, 
gids#328]\n\n== Physical Plan ==\nExchange RoundRobinPartitioning(200)\n+- Scan 
ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange
 
RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan
 
ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number
 of output 
rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data
 size total (min, med, 
max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 
1, column: 1622] (through reference chain: 
org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"])
at 
com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51)
at 
com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839)
at 
com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:399)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:296)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:133)
at 
com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:245)
at 
com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217)
at 
com.fasterxml.jackson.module.scala.deser.SeqDeserializer.deserialize(SeqDeserializerModule.scala:76)
at 
com.fasterxml.jackson.module.scala.deser.SeqDeserializer.deserialize(SeqDeserializerModule.scala:59)
at 
com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:520)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeWithErrorWrapping(BeanDeserializer.java:463)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:378)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:296)
at 
com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:133)
at 
com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:520)
at

[jira] [Commented] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158241#comment-16158241
 ] 

Apache Spark commented on SPARK-21936:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/19163

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158224#comment-16158224
 ] 

Apache Spark commented on SPARK-21726:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/19161

> Check for structural integrity of the plan in QO in test mode
> -
>
> Key: SPARK-21726
> URL: https://issues.apache.org/jira/browse/SPARK-21726
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> Right now we don't have any checks in the optimizer to check for the 
> structural integrity of the plan (e.g. resolved). It would be great if in 
> test mode, we can check whether a plan is still resolved after the execution 
> of each rule, so we can catch rules that return invalid plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21931) add LNNVL function

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21931.
---
Resolution: Won't Fix

> add LNNVL function
> --
>
> Key: SPARK-21931
> URL: https://issues.apache.org/jira/browse/SPARK-21931
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Ruslan Dautkhanov
>Priority: Minor
> Attachments: Capture1.JPG
>
>
> Purpose
> LNNVL provides a concise way to evaluate a condition when one or both 
> operands of the condition may be null. The function can be used only in the 
> WHERE clause of a query. It takes as an argument a condition and returns TRUE 
> if the condition is FALSE or UNKNOWN and FALSE if the condition is TRUE. 
> LNNVL can be used anywhere a scalar expression can appear, even in contexts 
> where the IS (NOT) NULL, AND, or OR conditions are not valid but would 
> otherwise be required to account for potential nulls. Oracle Database 
> sometimes uses the LNNVL function internally in this way to rewrite NOT IN 
> conditions as NOT EXISTS conditions. In such cases, output from EXPLAIN PLAN 
> shows this operation in the plan table output. The condition can evaluate any 
> scalar values but cannot be a compound condition containing AND, OR, or 
> BETWEEN.
> The table that follows shows what LNNVL returns given that a = 2 and b is 
> null.
> !Capture1.JPG!
> https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21915.
---
   Resolution: Fixed
Fix Version/s: 2.2.1

Issue resolved by pull request 19152
[https://github.com/apache/spark/pull/19152]

> Model 1 and Model 2 ParamMaps Missing
> -
>
> Key: SPARK-21915
> URL: https://issues.apache.org/jira/browse/SPARK-21915
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, 
> 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>Reporter: Mark Tabladillo
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.2.1
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> Error in PySpark example code
> [https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py]
> The original Scala code says
> println("Model 2 was fit using parameters: " + model2.parent.extractParamMap)
> The parent is lr
> There is no method for accessing parent as is done in Scala.
> 
> This code has been tested in Python, and returns values consistent with Scala
> Proposing to call the lr variable instead of model1 or model2
> 
> This patch was tested with Spark 2.1.0 comparing the Scala and PySpark 
> results. Pyspark returns nothing at present for those two print lines.
> The output for model2 in PySpark should be
> {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the 
> convergence tolerance for iterative algorithms (>= 0).'): 1e-06,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 
> 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 
> penalty.'): 0.0,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', 
> doc='prediction column name.'): 'prediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', 
> doc='features column name.'): 'features',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', 
> doc='label column name.'): 'label',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='probabilityCol', doc='Column name for predicted class conditional 
> probabilities. Note: Not all models output well-calibrated probability 
> estimates! These probabilities should be treated as confidences, not precise 
> probabilities.'): 'myProbability',
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column 
> name.'): 'rawPrediction',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', 
> doc='The name of family which is a description of the label distribution to 
> be used in the model. Supported options: auto, binomial, multinomial'): 
> 'auto',
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', 
> doc='whether to fit an intercept term.'): True,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', 
> doc='Threshold in binary classification prediction, in range [0, 1]. If 
> threshold and thresholds are both set, they must match.e.g. if threshold is 
> p, then thresholds must be equal to [1-p, p].'): 0.55,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', 
> doc='max number of iterations (>= 0).'): 30,
> Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', 
> doc='regularization parameter (>= 0).'): 0.1,
> Param(parent='LogisticRegression_4187be538f744d5a9090', 
> name='standardization', doc='whether to standardize the training features 
> before fitting the model.'): True}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21951) Unable to add the new column and writing into the Hive using spark

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21951.
---
Resolution: Invalid

This doesn't express a problem, and questions should go to the mailing list

> Unable to add the new column and writing into the Hive using spark
> --
>
> Key: SPARK-21951
> URL: https://issues.apache.org/jira/browse/SPARK-21951
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> I am creating one new column to the Existing Dataset and unable to write into 
> the Hive using the Spark.
> Ex: Dataset ds=spark.sql("select *from Table");
>  ds= ds.withColumn("newColumn",newColumnvalues);
> ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); 
> //Here I am getting the Exception
> I am loading the Table from Hive using Spark, and  adding the new Column to 
> that Dataset and again write the same table into Hive with the "OverWrite" 
> option



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java

2017-09-08 Thread Sean Owen (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-21952.
---
Resolution: Invalid

Spam

> Unable to load the csv file into Dataset  using Spark with java
> ---
>
> Key: SPARK-21952
> URL: https://issues.apache.org/jira/browse/SPARK-21952
> Project: Spark
>  Issue Type: Bug
>  Components: Java API
>Affects Versions: 2.1.1
>Reporter: jalendhar Baddam
>
> Hi,
> I am trying to load the one csv file using spark with java ,The csv file 
> contains the one row with two end lines.I am attaching the csv file .Placing 
> the sample csv file content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java

2017-09-08 Thread jalendhar Baddam (JIRA)

jalendhar Baddam created SPARK-21952:


 Summary: Unable to load the csv file into Dataset  using Spark 
with java
 Key: SPARK-21952
 URL: https://issues.apache.org/jira/browse/SPARK-21952
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.1
Reporter: jalendhar Baddam


Hi,

I am trying to load the one csv file using spark with java ,The csv file 
contains the one row with two end lines.I am attaching the csv file .Placing 
the sample csv file content.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-21951) Unable to add the new column and writing into the Hive using spark

2017-09-08 Thread jalendhar Baddam (JIRA)

jalendhar Baddam created SPARK-21951:


 Summary: Unable to add the new column and writing into the Hive 
using spark
 Key: SPARK-21951
 URL: https://issues.apache.org/jira/browse/SPARK-21951
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 2.1.1
Reporter: jalendhar Baddam


I am creating one new column to the Existing Dataset and unable to write into 
the Hive using the Spark.
Ex: Dataset ds=spark.sql("select *from Table");
 ds= ds.withColumn("newColumn",newColumnvalues);
ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); 
//Here I am getting the Exception

I am loading the Table from Hive using Spark, and  adding the new Column to 
that Dataset and again write the same table into Hive with the "OverWrite" 
option



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor

2017-09-08 Thread yiming.xu (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158202#comment-16158202
 ] 

yiming.xu commented on SPARK-650:
-

I need a hook too. Some case, We need init something like spring initbean :(

> Add a "setup hook" API for running initialization code on each executor
> ---
>
> Key: SPARK-650
> URL: https://issues.apache.org/jira/browse/SPARK-650
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Matei Zaharia
>Priority: Minor
>
> Would be useful to configure things like reporting libraries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source

2017-09-08 Thread xinzhang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158192#comment-16158192
 ] 

xinzhang commented on SPARK-21067:
--

hi  [~dricard]
do u have any solutions now? 
any suggests will helpful.

> Thrift Server - CTAS fail with Unable to move source
> 
>
> Key: SPARK-21067
> URL: https://issues.apache.org/jira/browse/SPARK-21067
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1, 2.2.0
> Environment: Yarn
> Hive MetaStore
> HDFS (HA)
>Reporter: Dominic Ricard
>
> After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS 
> would fail, sometimes...
> Most of the time, the CTAS would work only once, after starting the thrift 
> server. After that, dropping the table and re-issuing the same CTAS would 
> fail with the following message (Sometime, it fails right away, sometime it 
> work for a long period of time):
> {noformat}
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> We have already found the following Jira 
> (https://issues.apache.org/jira/browse/SPARK-11021) which state that the 
> {{hive.exec.stagingdir}} had to be added in order for Spark to be able to 
> handle CREATE TABLE properly as of 2.0. As you can see in the error, we have 
> ours set to "/tmp/hive-staging/\{user.name\}"
> Same issue with INSERT statements:
> {noformat}
> CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE 
> dricard.test SELECT 1;
> Error: org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0
>  to destination 
> hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; 
> (state=,code=0)
> {noformat}
> This worked fine in 1.6.2, which we currently run in our Production 
> Environment but since 2.0+, we haven't been able to CREATE TABLE consistently 
> on the cluster.
> SQL to reproduce issue:
> {noformat}
> DROP SCHEMA IF EXISTS dricard CASCADE; 
> CREATE SCHEMA dricard; 
> CREATE TABLE dricard.test (col1 int); 
> INSERT INTO TABLE dricard.test SELECT 1; 
> SELECT * from dricard.test; 
> DROP TABLE dricard.test; 
> CREATE TABLE dricard.test AS select 1 as `col1`;
> SELECT * from dricard.test
> {noformat}
> Thrift server usually fails at INSERT...
> Tried the same procedure in a spark context using spark.sql() and didn't 
> encounter the same issue.
> Full stack Trace:
> {noformat}
> 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error 
> executing query, currentState RUNNING,
> org.apache.spark.sql.AnalysisException: 
> org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source 
> hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0
>  to desti
> nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0;
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106)
> at 
> org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221)
> at 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
> at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
> at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
> at 
> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
> at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
> at org.apache.spark.sql.Dataset.(Dataset.scala:185)
> at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64)
> at

[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB

2017-09-08 Thread cen yuhai (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158189#comment-16158189
 ] 

cen yuhai commented on SPARK-18492:
---

spark 2.1.1 also have this problem

> GeneratedIterator grows beyond 64 KB
> 
>
> Key: SPARK-18492
> URL: https://issues.apache.org/jira/browse/SPARK-18492
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.1
> Environment: CentOS release 6.7 (Final)
>Reporter: Norris Merritt
>
> spark-submit fails with ERROR CodeGenerator: failed to compile: 
> org.codehaus.janino.JaninoRuntimeException: Code of method 
> "(I[Lscala/collection/Iterator;)V" of class 
> "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" 
> grows beyond 64 KB
> Error message is followed by a huge dump of generated source code.
> The generated code declares 1,454 field sequences like the following:
> /* 036 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1;
> /* 037 */   private scala.Function1 project_catalystConverter1;
> /* 038 */   private scala.Function1 project_converter1;
> /* 039 */   private scala.Function1 project_converter2;
> /* 040 */   private scala.Function2 project_udf1;
>   (many omitted lines) ...
> /* 6089 */   private org.apache.spark.sql.catalyst.expressions.ScalaUDF 
> project_scalaUDF1454;
> /* 6090 */   private scala.Function1 project_catalystConverter1454;
> /* 6091 */   private scala.Function1 project_converter1695;
> /* 6092 */   private scala.Function1 project_udf1454;
> It then proceeds to emit code for several methods (init, processNext) each of 
> which has totally repetitive sequences of statements pertaining to each of 
> the sequences of variables declared in the class.  For example:
> /* 6101 */   public void init(int index, scala.collection.Iterator inputs[]) {
> The reason that the 64KB JVM limit for code for a method is exceeded is 
> because the code generator is using an incredibly naive strategy.  It emits a 
> sequence like the one shown below for each of the 1,454 groups of variables 
> shown above, in 
> /* 6132 */ this.project_udf = 
> (scala.Function1)project_scalaUDF.userDefinedFunc();
> /* 6133 */ this.project_scalaUDF1 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10];
> /* 6134 */ this.project_catalystConverter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType());
> /* 6135 */ this.project_converter1 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType());
> /* 6136 */ this.project_converter2 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType());
> It blows up after emitting 230 such sequences, while trying to emit the 231st:
> /* 7282 */ this.project_udf230 = 
> (scala.Function2)project_scalaUDF230.userDefinedFunc();
> /* 7283 */ this.project_scalaUDF231 = 
> (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240];
> /* 7284 */ this.project_catalystConverter231 = 
> (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType());
>   many omitted lines ...
>  Example of repetitive code sequences emitted for processNext method:
> /* 12253 */   boolean project_isNull247 = project_result244 == null;
> /* 12254 */   MapData project_value247 = null;
> /* 12255 */   if (!project_isNull247) {
> /* 12256 */ project_value247 = project_result244;
> /* 12257 */   }
> /* 12258 */   Object project_arg = sort_isNull5 ? null : 
> project_converter489.apply(sort_value5);
> /* 12259 */
> /* 12260 */   ArrayData project_result249 = null;
> /* 12261 */   try {
> /* 12262 */ project_result249 = 
> (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg));
> /* 12263 */   } catch (Exception e) {
> /* 12264 */ throw new 
> org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e);
> /* 12265 */   }
> /* 12266 */
> /* 12267 */   boolean project_isNull252 = project_result249 == null;
> /* 12268 */   ArrayData project_value252 = null;
> /* 12269 */   if (!project_isNull252) {
> /* 12270 */ project_value252 = project_result249;
> /* 12271 */   }
> /* 12272 */   Object project_arg1 = project_isNull252 ? null :

[jira] [Resolved] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21936.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> backward compatibility test framework for HiveExternalCatalog
> -
>
> Key: SPARK-21936
> URL: https://issues.apache.org/jira/browse/SPARK-21936
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21934:


Assignee: (was: Apache Spark)

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158184#comment-16158184
 ] 

Apache Spark commented on SPARK-21934:
--

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/19160

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21934) Expose Netty memory usage via Metrics System

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21934:


Assignee: Apache Spark

> Expose Netty memory usage via Metrics System
> 
>
> Key: SPARK-21934
> URL: https://issues.apache.org/jira/browse/SPARK-21934
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Saisai Shao
>Assignee: Apache Spark
>
> This is a follow-up work of SPARK-9104 to expose the Netty memory usage to 
> MetricsSystem. My initial thought is to only expose Shuffle memory usage, 
> since shuffle is a major part of memory usage in network communication 
> compared to RPC, file server, block transfer. 
> If user wants to also expose Netty memory usage for other modules, we could 
> add more metrics later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21946:


Assignee: (was: Apache Spark)

> Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
> 
>
> Key: SPARK-21946
> URL: https://issues.apache.org/jira/browse/SPARK-21946
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> According to the [Apache Spark Jenkins 
> History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/]
> InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. 
> We had better stablize this.
> {code}
> - alter table: rename cached table !!! CANCELED !!!
>   Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data 
> (DDLSuite.scala:786)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`

2017-09-08 Thread Apache Spark (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-21946:


Assignee: Apache Spark

> Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
> 
>
> Key: SPARK-21946
> URL: https://issues.apache.org/jira/browse/SPARK-21946
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Minor
>
> According to the [Apache Spark Jenkins 
> History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/]
> InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. 
> We had better stablize this.
> {code}
> - alter table: rename cached table !!! CANCELED !!!
>   Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data 
> (DDLSuite.scala:786)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`

2017-09-08 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158176#comment-16158176
 ] 

Apache Spark commented on SPARK-21946:
--

User 'kiszk' has created a pull request for this issue:
https://github.com/apache/spark/pull/19159

> Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
> 
>
> Key: SPARK-21946
> URL: https://issues.apache.org/jira/browse/SPARK-21946
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.2.0
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> According to the [Apache Spark Jenkins 
> History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/]
> InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. 
> We had better stablize this.
> {code}
> - alter table: rename cached table !!! CANCELED !!!
>   Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data 
> (DDLSuite.scala:786)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21726) Check for structural integrity of the plan in QO in test mode

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21726.
-
   Resolution: Fixed
 Assignee: Liang-Chi Hsieh
Fix Version/s: 2.3.0

> Check for structural integrity of the plan in QO in test mode
> -
>
> Key: SPARK-21726
> URL: https://issues.apache.org/jira/browse/SPARK-21726
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Reynold Xin
>Assignee: Liang-Chi Hsieh
> Fix For: 2.3.0
>
>
> Right now we don't have any checks in the optimizer to check for the 
> structural integrity of the plan (e.g. resolved). It would be great if in 
> test mode, we can check whether a plan is still resolved after the execution 
> of each rule, so we can catch rules that return invalid plans.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-21949) Tables created in unit tests should be dropped after use

2017-09-08 Thread Xiao Li (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-21949.
-
   Resolution: Fixed
 Assignee: liuxian
Fix Version/s: 2.3.0

> Tables created in unit tests should be dropped after use
> 
>
> Key: SPARK-21949
> URL: https://issues.apache.org/jira/browse/SPARK-21949
> Project: Spark
>  Issue Type: Test
>  Components: Tests
>Affects Versions: 2.3.0
>Reporter: liuxian
>Assignee: liuxian
>Priority: Trivial
> Fix For: 2.3.0
>
>
> Tables should be dropped after use in unit tests.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

86 matches

Mail list logo