[jira] [Commented] (SPARK-21902) BlockManager.doPut will hide actually exception when exception thrown in finally block
[ https://issues.apache.org/jira/browse/SPARK-21902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159623#comment-16159623 ] Apache Spark commented on SPARK-21902: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/19171 > BlockManager.doPut will hide actually exception when exception thrown in > finally block > -- > > Key: SPARK-21902 > URL: https://issues.apache.org/jira/browse/SPARK-21902 > Project: Spark > Issue Type: Wish > Components: Block Manager >Affects Versions: 2.1.0 >Reporter: zhoukang > > As logging below, actually exception will be hidden when removeBlockInternal > throw an exception. > {code:java} > 2017-08-31,10:26:57,733 WARN org.apache.spark.storage.BlockManager: Putting > block broadcast_110 failed due to an exception > 2017-08-31,10:26:57,734 WARN org.apache.spark.broadcast.BroadcastManager: > Failed to create a new broadcast in 1 attempts > java.io.IOException: Failed to create local dir in > /tmp/blockmgr-5bb5ac1e-c494-434a-ab89-bd1808c6b9ed/2e. > at > org.apache.spark.storage.DiskBlockManager.getFile(DiskBlockManager.scala:70) > at org.apache.spark.storage.DiskStore.remove(DiskStore.scala:115) > at > org.apache.spark.storage.BlockManager.removeBlockInternal(BlockManager.scala:1339) > at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:910) > at > org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:948) > at > org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:726) > at > org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1233) > at > org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:122) > at > org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) > at > org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) > at > org.apache.spark.broadcast.BroadcastManager$$anonfun$newBroadcast$1.apply$mcVI$sp(BroadcastManager.scala:60) > at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) > at > org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:58) > at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1415) > at > org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1002) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:924) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:771) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:770) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:770) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1235) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1662) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1620) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1609) > at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) > {code} > I want to print the exception first for troubleshooting.Or may be we should > not throw exception when removing blocks. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21962) Distributed Tracing in Spark
Andrew Ash created SPARK-21962: -- Summary: Distributed Tracing in Spark Key: SPARK-21962 URL: https://issues.apache.org/jira/browse/SPARK-21962 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.2.0 Reporter: Andrew Ash Spark should support distributed tracing, which is the mechanism, widely popularized by Google in the [Dapper Paper|https://research.google.com/pubs/pub36356.html], where network requests have additional metadata used for tracing requests between services. This would be useful for me since I have OpenZipkin style tracing in my distributed application up to the Spark driver, and from the executors out to my other services, but the link is broken in Spark between driver and executor since the Span IDs aren't propagated across that link. An initial implementation could instrument the most important network calls with trace ids (like launching and finishing tasks), and incrementally add more tracing to other calls (torrent block distribution, external shuffle service, etc) as the feature matures. Search keywords: Dapper, Brave, OpenZipkin, HTrace -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21961: Assignee: Apache Spark > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou >Assignee: Apache Spark > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 23GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. > We have deployed our Spark History Server with this filter which works fine > in our production cluster, which has processed thousands of logs and only got > several full GC in total. > !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! > !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21961: Assignee: (was: Apache Spark) > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 23GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. > We have deployed our Spark History Server with this filter which works fine > in our production cluster, which has processed thousands of logs and only got > several full GC in total. > !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! > !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16159514#comment-16159514 ] Ye Zhou commented on SPARK-21961: - Pull Request Added: https://github.com/apache/spark/pull/19170 > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 23GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. > We have deployed our Spark History Server with this filter which works fine > in our production cluster, which has processed thousands of logs and only got > several full GC in total. > !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! > !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Description: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 23GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. We have deployed our Spark History Server with this filter which works fine in our production cluster, which has processed thousands of logs and only got several full GC in total. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! was: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 23GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 23GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Description: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 23GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! was: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 23GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. >
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Description: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! !https://issues.apache.org/jira/secure/attachment/12886190/One_Thread_Took_24GB.png! was: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 24GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. > !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! >
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Description: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! was: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 24GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. > !https://issues.apache.org/jira/secure/attachment/12886191/Objects_Count_in_Heap.png! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Description: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. was: As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 24GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
[ https://issues.apache.org/jira/browse/SPARK-21961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ye Zhou updated SPARK-21961: Attachment: Objects_Count_in_Heap.png One_Thread_Took_24GB.png > Filter out BlockStatuses Accumulators during replaying history logs in Spark > History Server > --- > > Key: SPARK-21961 > URL: https://issues.apache.org/jira/browse/SPARK-21961 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Ye Zhou > Attachments: Objects_Count_in_Heap.png, One_Thread_Took_24GB.png > > > As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of > memory in Driver. Recently we also noticed the same issue in Spark History > Server. Even though in SPARK-20084, those event logs are getting removed from > history log. But multiple versions of Spark including 1.6.x and 2.1.0 > versions are deployed in our production cluster, none of them have these two > patches included. > In this case, those event logs will still be in shown up in logs and Spark > History Server will replay them. Spark History Server continuously get severe > Full GCs even though we tried to limit cache size as well as enlarge the > heapsize to 40GB. We also tried with different GC tuning parameters, like > using CMS or G1GC. None of them works. > We made a heap dump, and found that the top memory consumer objects is > BlockStatus. There was even one thread that took 24GB heap which was > replaying one log file. > Since the former two tickets has resolved related issues in both driver and > writing to history logs, we should also consider add this filter to Spark > History Server in order to decrease the memory consumption for replaying one > history log. For use cases like us, where we have multiple older versions of > Spark deployed, this filter should be pretty useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21961) Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server
Ye Zhou created SPARK-21961: --- Summary: Filter out BlockStatuses Accumulators during replaying history logs in Spark History Server Key: SPARK-21961 URL: https://issues.apache.org/jira/browse/SPARK-21961 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.0, 2.1.0 Reporter: Ye Zhou As described in SPARK-20923, TaskMetrics._updatedBlockStatuses uses a lot of memory in Driver. Recently we also noticed the same issue in Spark History Server. Even though in SPARK-20084, those event logs are getting removed from history log. But multiple versions of Spark including 1.6.x and 2.1.0 versions are deployed in our production cluster, none of them have these two patches included. In this case, those event logs will still be in shown up in logs and Spark History Server will replay them. Spark History Server continuously get severe Full GCs even though we tried to limit cache size as well as enlarge the heapsize to 40GB. We also tried with different GC tuning parameters, like using CMS or G1GC. None of them works. We made a heap dump, and found that the top memory consumer objects is BlockStatus. There was even one thread that took 24GB heap which was replaying one log file. Since the former two tickets has resolved related issues in both driver and writing to history logs, we should also consider add this filter to Spark History Server in order to decrease the memory consumption for replaying one history log. For use cases like us, where we have multiple older versions of Spark deployed, this filter should be pretty useful. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21960) Spark Streaming Dynamic Allocation should respect spark.executor.instances
Karthik Palaniappan created SPARK-21960: --- Summary: Spark Streaming Dynamic Allocation should respect spark.executor.instances Key: SPARK-21960 URL: https://issues.apache.org/jira/browse/SPARK-21960 Project: Spark Issue Type: Improvement Components: DStreams Affects Versions: 2.2.0 Reporter: Karthik Palaniappan Priority: Minor This check enforces that spark.executor.instances (aka --num-executors) is either unset or explicitly set to 0. https://github.com/apache/spark/blob/v2.2.0/streaming/src/main/scala/org/apache/spark/streaming/scheduler/ExecutorAllocationManager.scala#L207 If spark.executor.instances is unset, the check is fine, and the property defaults to 2. Spark requests the cluster manager for 2 executors to start with, then adds/removes executors appropriately. However, if you explicitly set it to 0, the check also succeeds, but Spark never asks the cluster manager for any executors. When running on YARN, I repeatedly saw: {code:java} 17/08/22 19:35:21 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 17/08/22 19:35:36 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources 17/08/22 19:35:51 WARN org.apache.spark.scheduler.cluster.YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources {code} I noticed that at least Google Dataproc and Ambari explicitly set spark.executor.instances to a positive number, meaning that to use dynamic allocation, you would have to edit spark-defaults.conf to remove the property. That's obnoxious. In addition, in Spark 2.3, spark-submit will refuse to accept "0" as a value for --num-executors or --conf spark.executor.instances: https://github.com/apache/spark/commit/0fd84b05dc9ac3de240791e2d4200d8bdffbb01a#diff-63a5d817d2d45ae24de577f6a1bd80f9 It is much more reasonable for Streaming DRA to use spark.executor.instances, just like Core DRA. I'll open a pull request to remove the check if there are no objections. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19866) Add local version of Word2Vec findSynonyms for spark.ml: Python API
[ https://issues.apache.org/jira/browse/SPARK-19866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-19866. - Resolution: Fixed Fix Version/s: 2.3.0 > Add local version of Word2Vec findSynonyms for spark.ml: Python API > --- > > Key: SPARK-19866 > URL: https://issues.apache.org/jira/browse/SPARK-19866 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Xin Ren >Priority: Minor > Fix For: 2.3.0 > > > Add Python API for findSynonymsArray matching Scala API in linked JIRA. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError
[ https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk resolved SPARK-15243. - Resolution: Fixed > Binarizer.explainParam(u"...") raises ValueError > > > Key: SPARK-15243 > URL: https://issues.apache.org/jira/browse/SPARK-15243 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: CentOS 7, Spark 1.6.0 >Reporter: Kazuki Yokoishi >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > When unicode is passed to Binarizer.explainParam(), ValueError occurs. > To reproduce: > {noformat} > >>> binarizer = Binarizer(threshold=1.0, inputCol="values", > >>> outputCol="features") > >>> binarizer.explainParam("threshold") # str can be passed > 'threshold: threshold in binary classification prediction, in range [0, 1] > (default: 0.0, current: 1.0)' > >>> binarizer.explainParam(u"threshold") # unicode cannot be passed > --- > ValueErrorTraceback (most recent call last) > in () > > 1 binarizer.explainParam(u"threshold") > /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, > param) > 96 default value and user-supplied value in a string. > 97 """ > ---> 98 param = self._resolveParam(param) > 99 values = [] > 100 if self.isDefined(param): > /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, > param) > 231 return self.getParam(param) > 232 else: > --> 233 raise ValueError("Cannot resolve %r as a param." % param) > 234 > 235 @staticmethod > ValueError: Cannot resolve u'threshold' as a param. > {noformat} > Same erros occur in other methods. > * Binarizer.hasDefault() > * Binarizer.getOrDefault() > * Binarizer.isSet() > These errors are caused by checks *isinstance(obj, str)* in > pyspark.ml.param.Params._resolveParam(). > basestring should be used instead of str in isinstance() for backward > compatibility as below. > {noformat} > if sys.version >= '3': > basestring = str > if isinstance(obj, basestring): > # TODO > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError
[ https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk updated SPARK-15243: Fix Version/s: 2.3.0 > Binarizer.explainParam(u"...") raises ValueError > > > Key: SPARK-15243 > URL: https://issues.apache.org/jira/browse/SPARK-15243 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: CentOS 7, Spark 1.6.0 >Reporter: Kazuki Yokoishi >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > When unicode is passed to Binarizer.explainParam(), ValueError occurs. > To reproduce: > {noformat} > >>> binarizer = Binarizer(threshold=1.0, inputCol="values", > >>> outputCol="features") > >>> binarizer.explainParam("threshold") # str can be passed > 'threshold: threshold in binary classification prediction, in range [0, 1] > (default: 0.0, current: 1.0)' > >>> binarizer.explainParam(u"threshold") # unicode cannot be passed > --- > ValueErrorTraceback (most recent call last) > in () > > 1 binarizer.explainParam(u"threshold") > /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, > param) > 96 default value and user-supplied value in a string. > 97 """ > ---> 98 param = self._resolveParam(param) > 99 values = [] > 100 if self.isDefined(param): > /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, > param) > 231 return self.getParam(param) > 232 else: > --> 233 raise ValueError("Cannot resolve %r as a param." % param) > 234 > 235 @staticmethod > ValueError: Cannot resolve u'threshold' as a param. > {noformat} > Same erros occur in other methods. > * Binarizer.hasDefault() > * Binarizer.getOrDefault() > * Binarizer.isSet() > These errors are caused by checks *isinstance(obj, str)* in > pyspark.ml.param.Params._resolveParam(). > basestring should be used instead of str in isinstance() for backward > compatibility as below. > {noformat} > if sys.version >= '3': > basestring = str > if isinstance(obj, basestring): > # TODO > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15243) Binarizer.explainParam(u"...") raises ValueError
[ https://issues.apache.org/jira/browse/SPARK-15243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] holdenk reassigned SPARK-15243: --- Assignee: Hyukjin Kwon (was: Seth Hendrickson) > Binarizer.explainParam(u"...") raises ValueError > > > Key: SPARK-15243 > URL: https://issues.apache.org/jira/browse/SPARK-15243 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.0.0 > Environment: CentOS 7, Spark 1.6.0 >Reporter: Kazuki Yokoishi >Assignee: Hyukjin Kwon >Priority: Minor > > When unicode is passed to Binarizer.explainParam(), ValueError occurs. > To reproduce: > {noformat} > >>> binarizer = Binarizer(threshold=1.0, inputCol="values", > >>> outputCol="features") > >>> binarizer.explainParam("threshold") # str can be passed > 'threshold: threshold in binary classification prediction, in range [0, 1] > (default: 0.0, current: 1.0)' > >>> binarizer.explainParam(u"threshold") # unicode cannot be passed > --- > ValueErrorTraceback (most recent call last) > in () > > 1 binarizer.explainParam(u"threshold") > /usr/spark/current/python/pyspark/ml/param/__init__.py in explainParam(self, > param) > 96 default value and user-supplied value in a string. > 97 """ > ---> 98 param = self._resolveParam(param) > 99 values = [] > 100 if self.isDefined(param): > /usr/spark/current/python/pyspark/ml/param/__init__.py in _resolveParam(self, > param) > 231 return self.getParam(param) > 232 else: > --> 233 raise ValueError("Cannot resolve %r as a param." % param) > 234 > 235 @staticmethod > ValueError: Cannot resolve u'threshold' as a param. > {noformat} > Same erros occur in other methods. > * Binarizer.hasDefault() > * Binarizer.getOrDefault() > * Binarizer.isSet() > These errors are caused by checks *isinstance(obj, str)* in > pyspark.ml.param.Params._resolveParam(). > basestring should be used instead of str in isinstance() for backward > compatibility as below. > {noformat} > if sys.version >= '3': > basestring = str > if isinstance(obj, basestring): > # TODO > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158997#comment-16158997 ] Marcelo Vanzin commented on SPARK-18085: [~jincheng] that is caused by SPARK-17701. The bug is still open but the patch has actually been committed, and it removes a property of {{SparkPlanInfo}} that makes Spark 2.3 unable to read event logs from earlier versions. Can you file a new bug with that information? Thanks. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21907) NullPointerException in UnsafeExternalSorter.spill()
[ https://issues.apache.org/jira/browse/SPARK-21907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158995#comment-16158995 ] Kazuaki Ishizaki commented on SPARK-21907: -- If you cannot provide a repro, could you please run your program with the latest master branch? SPARK-21319 may alleviate this issue. > NullPointerException in UnsafeExternalSorter.spill() > > > Key: SPARK-21907 > URL: https://issues.apache.org/jira/browse/SPARK-21907 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Juliusz Sompolski > > I see NPE during sorting with the following stacktrace: > {code} > java.lang.NullPointerException > at > org.apache.spark.memory.TaskMemoryManager.getPage(TaskMemoryManager.java:383) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:63) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter$SortComparator.compare(UnsafeInMemorySorter.java:43) > at > org.apache.spark.util.collection.TimSort.countRunAndMakeAscending(TimSort.java:270) > at org.apache.spark.util.collection.TimSort.sort(TimSort.java:142) > at org.apache.spark.util.collection.Sorter.sort(Sorter.scala:37) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.getSortedIterator(UnsafeInMemorySorter.java:345) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:206) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.reset(UnsafeInMemorySorter.java:173) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.spill(UnsafeExternalSorter.java:221) > at > org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:203) > at > org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:281) > at > org.apache.spark.memory.MemoryConsumer.allocateArray(MemoryConsumer.java:90) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.growPointerArrayIfNecessary(UnsafeExternalSorter.java:349) > at > org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:400) > at > org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:109) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at > org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext(RowIterator.scala:83) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.advancedStreamed(SortMergeJoinExec.scala:778) > at > org.apache.spark.sql.execution.joins.SortMergeJoinScanner.findNextInnerJoinRows(SortMergeJoinExec.scala:685) > at > org.apache.spark.sql.execution.joins.SortMergeJoinExec$$anonfun$doExecute$1$$anon$2.advanceNext(SortMergeJoinExec.scala:259) > at > org.apache.spark.sql.execution.RowIteratorToScala.hasNext(RowIterator.scala:68) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:108) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:346) > at >
[jira] [Resolved] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie
[ https://issues.apache.org/jira/browse/SPARK-21959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21959. --- Resolution: Invalid There's no detail on the job, and no indication that this is a problem in Spark. Your app is just running out of memory. You optimized your app and it worked. Not something you report as a JIRA. > Python RDD goes into never ending garbage collection service when spark > submit is triggered in oozie > > > Key: SPARK-21959 > URL: https://issues.apache.org/jira/browse/SPARK-21959 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.1.0 > Environment: Head Node - 2 - 8 cores -55 GB/Node > Worker Node - 5 - 4 cores - 28 GB/Node >Reporter: VP > Original Estimate: 30h > Remaining Estimate: 30h > > When the job is submitted through spark submit , the code executes fine > But when called through the oozie , whenever a PythonRDD is triggered , it > gets into garbage collecting service which is never ending. > When the RDD is replaced by Dataframe , the code executes fine. > Need to understand the proper root cause on why the garbage collection > service is invoked only when called through oozir -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie
[ https://issues.apache.org/jira/browse/SPARK-21959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen closed SPARK-21959. - > Python RDD goes into never ending garbage collection service when spark > submit is triggered in oozie > > > Key: SPARK-21959 > URL: https://issues.apache.org/jira/browse/SPARK-21959 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Submit >Affects Versions: 2.1.0 > Environment: Head Node - 2 - 8 cores -55 GB/Node > Worker Node - 5 - 4 cores - 28 GB/Node >Reporter: VP > Original Estimate: 30h > Remaining Estimate: 30h > > When the job is submitted through spark submit , the code executes fine > But when called through the oozie , whenever a PythonRDD is triggered , it > gets into garbage collecting service which is never ending. > When the RDD is replaced by Dataframe , the code executes fine. > Need to understand the proper root cause on why the garbage collection > service is invoked only when called through oozir -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21959) Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie
Vega Paleri created SPARK-21959: --- Summary: Python RDD goes into never ending garbage collection service when spark submit is triggered in oozie Key: SPARK-21959 URL: https://issues.apache.org/jira/browse/SPARK-21959 Project: Spark Issue Type: Bug Components: PySpark, Spark Submit Affects Versions: 2.1.0 Environment: Head Node - 2 - 8 cores -55 GB/Node Worker Node - 5 - 4 cores - 28 GB/Node Reporter: Vega Paleri When the job is submitted through spark submit , the code executes fine But when called through the oozie , whenever a PythonRDD is triggered , it gets into garbage collecting service which is never ending. When the RDD is replaced by Dataframe , the code executes fine. Need to understand the proper root cause on why the garbage collection service is invoked only when called through oozir -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21893) Put Kafka 0.8 behind a profile
[ https://issues.apache.org/jira/browse/SPARK-21893?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21893: -- Description: Kafka does not support 0.8.x for Scala 2.12. This code will have to, at least, be optionally enabled by a profile, which could be enabled by default for 2.11. Or outright removed. Update: it'll also require removing 0.8.x examples, because otherwise the example module has to be split. While not necessarily connected, it's probably a decent point to declare 0.8 deprecated. And that means declaring 0.10 (the other API left) as stable. was:Kafka does not support 0.8.x for Scala 2.12. This code will have to, at least, be optionally enabled by a profile, which could be enabled by default for 2.11. Or outright removed. > Put Kafka 0.8 behind a profile > -- > > Key: SPARK-21893 > URL: https://issues.apache.org/jira/browse/SPARK-21893 > Project: Spark > Issue Type: Sub-task > Components: DStreams >Affects Versions: 2.2.0 >Reporter: Sean Owen >Priority: Minor > > Kafka does not support 0.8.x for Scala 2.12. This code will have to, at > least, be optionally enabled by a profile, which could be enabled by default > for 2.11. Or outright removed. > Update: it'll also require removing 0.8.x examples, because otherwise the > example module has to be split. > While not necessarily connected, it's probably a decent point to declare 0.8 > deprecated. And that means declaring 0.10 (the other API left) as stable. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21944) Watermark on window column is wrong
[ https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158948#comment-16158948 ] Kevin Zhang commented on SPARK-21944: - [~mgaido] Do you mean the following way by saying "define the watermark on the column 'time' "? {code:java} val counts = events.select(window($"time", "5 seconds"), $"time", $"id") .withWatermark("time", "10 seconds") .dropDuplicates("id", "window") .groupBy("window") .count {code} I don't know whether this is right, because the documentation indicates we should use the same column as is used in watermark, that is "time" column(which is not what I want). I tried this way and the application dosen't throw any exception, but it didn't drop events older than the watermark as expected. In the following example, after the batch containing an event with time=1504774540(2017/9/7 16:55:40 CST) is processed(the watermark should be adjust to 2017/9/7 16:55:30 CST), then I send an event with time=1504745724(2017/9/7 8:55:24 CST), this event is processed instead of being dropped as expected. {code:java} +-+-+ |window |count| +-+-+ |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| |[2017-09-07 08:55:20.0,2017-09-07 08:55:25.0]|1| |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| +-+-+ {min=2017-09-07T00:55:24.000Z, avg=2017-09-07T00:55:24.000Z, watermark=2017-09-07T08:55:30.000Z, max=2017-09-07T00:55:24.000Z} {code} Here is one thing important I have to say, that is my time zone is CST, instead of UTC. The start and end time in window is right, but the watermark is reported in UTC. I don't know whether this influences. If I didn't make everything clear, please point it and I will explain. Thanks > Watermark on window column is wrong > --- > > Key: SPARK-21944 > URL: https://issues.apache.org/jira/browse/SPARK-21944 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > When I use a watermark with dropDuplicates in the following way, the > watermark is calculated wrong > {code:java} > val counts = events.select(window($"time", "5 seconds"), $"time", $"id") > .withWatermark("window", "10 seconds") > .dropDuplicates("id", "window") > .groupBy("window") > .count > {code} > where events is a dataframe with a timestamp column "time" and long column > "id". > I registered a listener to print the event time stats in each batch, and the > results is like the following > {code:shell} > --- > Batch: 0 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > --- > Batch: 1 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > --- > Batch: 2 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > {code} > As can be seen, the event time stats are wrong which are always in > 1970-01-01, so the watermark
[jira] [Commented] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158924#comment-16158924 ] Ryan Blue commented on SPARK-20958: --- [~spiricalsalsaz], you need to only pin parquet-avro, not the other Parquet libs. This is caused by a bug in Parquet that has been fixed in 1.8.2, so you want the 1.8.2 version of parquet-hadoop, but the 1.8.1 version of parquet-avro. Alternatively, you can shade and relocate the version of Avro you want and use parquet-avro 1.8.2. That's what I'd recommend. > Roll back parquet-mr 1.8.2 to parquet-1.8.1 > --- > > Key: SPARK-20958 > URL: https://issues.apache.org/jira/browse/SPARK-20958 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Cheng Lian >Assignee: Cheng Lian > Labels: release-notes, release_notes, releasenotes > > We recently realized that parquet-mr 1.8.2 used by Spark 2.2.0-rc2 depends on > avro 1.8.1, which is incompatible with avro 1.7.6 used by parquet-mr 1.8.1 > and avro 1.7.7 used by spark-core 2.2.0-rc2. > Basically, Spark 2.2.0-rc2 introduced two incompatible versions of avro > (1.7.7 and 1.8.1). Upgrading avro 1.7.7 to 1.8.1 is not preferable due to the > reasons mentioned in [PR > #17163|https://github.com/apache/spark/pull/17163#issuecomment-286563131]. > Therefore, we don't really have many choices here and have to roll back > parquet-mr 1.8.2 to 1.8.1 to resolve this dependency conflict. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21128) Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"
[ https://issues.apache.org/jira/browse/SPARK-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Felix Cheung updated SPARK-21128: - Target Version/s: 2.2.1, 2.3.0 (was: 2.3.0) Fix Version/s: 2.2.1 > Running R tests multiple times failed due to pre-exiting "spark-warehouse" / > "metastore_db" > --- > > Key: SPARK-21128 > URL: https://issues.apache.org/jira/browse/SPARK-21128 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > Currently, running R tests multiple times fails due to pre-exiting > "spark-warehouse" / "metastore_db" as below: > {code} > SparkSQL functions: Spark package found in SPARK_HOME: .../spark > ...1234... > {code} > {code} > Failed > - > 1. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3384) > length(list1) not equal to length(list2). > 1/1 mismatches > [1] 25 - 23 == 2 > 2. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3384) > sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). > 10/25 mismatches > x[16]: "metastore_db" > y[16]: "pkg" > x[17]: "pkg" > y[17]: "R" > x[18]: "R" > y[18]: "README.md" > x[19]: "README.md" > y[19]: "run-tests.sh" > x[20]: "run-tests.sh" > y[20]: "SparkR_2.2.0.tar.gz" > x[21]: "metastore_db" > y[21]: "pkg" > x[22]: "pkg" > y[22]: "R" > x[23]: "R" > y[23]: "README.md" > x[24]: "README.md" > y[24]: "run-tests.sh" > x[25]: "run-tests.sh" > y[25]: "SparkR_2.2.0.tar.gz" > 3. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3388) > length(list1) not equal to length(list2). > 1/1 mismatches > [1] 25 - 23 == 2 > 4. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3388) > sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). > 10/25 mismatches > x[16]: "metastore_db" > y[16]: "pkg" > x[17]: "pkg" > y[17]: "R" > x[18]: "R" > y[18]: "README.md" > x[19]: "README.md" > y[19]: "run-tests.sh" > x[20]: "run-tests.sh" > y[20]: "SparkR_2.2.0.tar.gz" > x[21]: "metastore_db" > y[21]: "pkg" > x[22]: "pkg" > y[22]: "R" > x[23]: "R" > y[23]: "README.md" > x[24]: "README.md" > y[24]: "run-tests.sh" > x[25]: "run-tests.sh" > y[25]: "SparkR_2.2.0.tar.gz" > DONE > === > {code} > It looks we should remove both "spark-warehouse" and "metastore_db" _before_ > listing files into {{sparkRFilesBefore}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
[ https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21946. - Resolution: Fixed Fix Version/s: 2.3.0 2.2.1 > Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table` > > > Key: SPARK-21946 > URL: https://issues.apache.org/jira/browse/SPARK-21946 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > According to the [Apache Spark Jenkins > History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/] > InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. > We had better stablize this. > {code} > - alter table: rename cached table !!! CANCELED !!! > Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data > (DDLSuite.scala:786) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
[ https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-21946: --- Assignee: Kazuaki Ishizaki > Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table` > > > Key: SPARK-21946 > URL: https://issues.apache.org/jira/browse/SPARK-21946 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Kazuaki Ishizaki >Priority: Minor > Fix For: 2.2.1, 2.3.0 > > > According to the [Apache Spark Jenkins > History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/] > InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. > We had better stablize this. > {code} > - alter table: rename cached table !!! CANCELED !!! > Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data > (DDLSuite.scala:786) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li updated SPARK-21936: Fix Version/s: 2.2.1 > backward compatibility test framework for HiveExternalCatalog > - > > Key: SPARK-21936 > URL: https://issues.apache.org/jira/browse/SPARK-21936 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.2.1, 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158865#comment-16158865 ] Marcelo Vanzin commented on SPARK-18085: Do you really mean fixed, as in you're not seeing it anymore, or introduced? Anyway, I'll take a look; might be something that changed since I last rebased my branch. > SPIP: Better History Server scalability for many / large applications > - > > Key: SPARK-18085 > URL: https://issues.apache.org/jira/browse/SPARK-18085 > Project: Spark > Issue Type: Umbrella > Components: Spark Core, Web UI >Affects Versions: 2.0.0 >Reporter: Marcelo Vanzin > Labels: SPIP > Attachments: screenshot-1.png, screenshot-2.png, spark_hs_next_gen.pdf > > > It's a known fact that the History Server currently has some annoying issues > when serving lots of applications, and when serving large applications. > I'm filing this umbrella to track work related to addressing those issues. > I'll be attaching a document shortly describing the issues and suggesting a > path to how to solve them. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21190) SPIP: Vectorized UDFs in Python
[ https://issues.apache.org/jira/browse/SPARK-21190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158845#comment-16158845 ] Li Jin commented on SPARK-21190: To [~bryanc]'s point, PR [#18659|https://github.com/apache/spark/pull/18659] and PR [#19147|https://github.com/apache/spark/pull/19147] are largely similar and it makes sense not to keep two PR for the same thing. Also, what I am curious about is what kind of guide lines to follow to avoid such duplicate work in the future. IMHO, [~bryanc] has been linked his PR to this Jira a while back and has been actively engaging in all discussions, so I am not sure why do we need a similar second PR in this case. (Of course, however, if people think #18659 and #19147 are very different, that's another story). I have worked with [~bryanc] in the past together on SPARK-13534 and collaborating on the same branch worked quite well for us. Maybe that's something we should encourage? > SPIP: Vectorized UDFs in Python > --- > > Key: SPARK-21190 > URL: https://issues.apache.org/jira/browse/SPARK-21190 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > Labels: SPIP > Attachments: SPIPVectorizedUDFsforPython (1).pdf > > > *Background and Motivation* > Python is one of the most popular programming languages among Spark users. > Spark currently exposes a row-at-a-time interface for defining and executing > user-defined functions (UDFs). This introduces high overhead in serialization > and deserialization, and also makes it difficult to leverage Python libraries > (e.g. numpy, Pandas) that are written in native code. > > This proposal advocates introducing new APIs to support vectorized UDFs in > Python, in which a block of data is transferred over to Python in some > columnar format for execution. > > > *Target Personas* > Data scientists, data engineers, library developers. > > *Goals* > - Support vectorized UDFs that apply on chunks of the data frame > - Low system overhead: Substantially reduce serialization and deserialization > overhead when compared with row-at-a-time interface > - UDF performance: Enable users to leverage native libraries in Python (e.g. > numpy, Pandas) for data manipulation in these UDFs > > *Non-Goals* > The following are explicitly out of scope for the current SPIP, and should be > done in future SPIPs. Nonetheless, it would be good to consider these future > use cases during API design, so we can achieve some consistency when rolling > out new APIs. > > - Define block oriented UDFs in other languages (that are not Python). > - Define aggregate UDFs > - Tight integration with machine learning frameworks > > *Proposed API Changes* > The following sketches some possibilities. I haven’t spent a lot of time > thinking about the API (wrote it down in 5 mins) and I am not attached to > this design at all. The main purpose of the SPIP is to get feedback on use > cases and see how they can impact API design. > > A few things to consider are: > > 1. Python is dynamically typed, whereas DataFrames/SQL requires static, > analysis time typing. This means users would need to specify the return type > of their UDFs. > > 2. Ratio of input rows to output rows. We propose initially we require number > of output rows to be the same as the number of input rows. In the future, we > can consider relaxing this constraint with support for vectorized aggregate > UDFs. > 3. How do we handle null values, since Pandas doesn't have the concept of > nulls? > > Proposed API sketch (using examples): > > Use case 1. A function that defines all the columns of a DataFrame (similar > to a “map” function): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_on_entire_df(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A Pandas data frame. > """ > input[c] = input[a] + input[b] > Input[d] = input[a] - input[b] > return input > > spark.range(1000).selectExpr("id a", "id / 2 b") > .mapBatches(my_func_on_entire_df) > {code} > > Use case 2. A function that defines only one column (similar to existing > UDFs): > > {code} > @spark_udf(some way to describe the return schema) > def my_func_that_returns_one_column(input): > """ Some user-defined function. > > :param input: A Pandas DataFrame with two columns, a and b. > :return: :class: A numpy array > """ > return input[a] + input[b] > > my_func = udf(my_func_that_returns_one_column) > > df = spark.range(1000).selectExpr("id a", "id / 2 b") > df.withColumn("c", my_func(df.a, df.b)) > {code} > > > > *Optional Design Sketch* > I’m more
[jira] [Created] (SPARK-21958) Attempting to save large Word2Vec model hangs driver in constant GC.
Travis Hegner created SPARK-21958: - Summary: Attempting to save large Word2Vec model hangs driver in constant GC. Key: SPARK-21958 URL: https://issues.apache.org/jira/browse/SPARK-21958 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Environment: Running spark on yarn, hadoop 2.7.2 provided by the cluster Reporter: Travis Hegner In the new version of Word2Vec, the model saving was modified to estimate an appropriate number of partitions based on the kryo buffer size. This is a great improvement, but there is a caveat for very large models. The {{(word, vector)}} tuple goes through a transformation to a local case class of {{Data(word, vector)}}... I can only assume this is for the kryo serialization process. The new version of the code iterates over the entire vocabulary to do this transformation (the old version wrapped the entire datum) in the driver's heap. Only to have the result then distributed to the cluster to be written into it's parquet files. With extremely large vocabularies (~2 million docs, with uni-grams, bi-grams, and tri-grams), that local driver transformation is causing the driver to hang indefinitely in GC as I can only assume that it's generating millions of short lived objects which can't be evicted fast enough. Perhaps I'm overlooking something, but it seems to me that since the result is distributed over the cluster to be saved _after_ the transformation anyway, we may as well distribute it _first_, allowing the cluster resources to do the transformation more efficiently, and then write the parquet file from there. I have a patch implemented, and am in the process of testing it at scale. I will open a pull request when I feel that the patch is successfully resolving the issue, and after making sure that it passes unit tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158802#comment-16158802 ] Dominic Ricard commented on SPARK-21067: [~zhangxin0112zx] Our solution was to migrate the CTAS code to use Parquet... CTAS for Hive Tables is Broken when using the Thrift server. Still looking forward for a fix to this issue... > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at
[jira] [Assigned] (SPARK-21957) Add current_user function
[ https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21957: Assignee: Apache Spark > Add current_user function > - > > Key: SPARK-21957 > URL: https://issues.apache.org/jira/browse/SPARK-21957 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido >Assignee: Apache Spark >Priority: Minor > > Spark doesn't support the {{current_user}} function. > Despite the user can be retrieved in other ways, the function would help > making easier to migrate existing Hive queries to Spark and it can also be > convenient for people who are just using SQL to interact with Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21957) Add current_user function
[ https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158745#comment-16158745 ] Apache Spark commented on SPARK-21957: -- User 'mgaido91' has created a pull request for this issue: https://github.com/apache/spark/pull/19169 > Add current_user function > - > > Key: SPARK-21957 > URL: https://issues.apache.org/jira/browse/SPARK-21957 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido >Priority: Minor > > Spark doesn't support the {{current_user}} function. > Despite the user can be retrieved in other ways, the function would help > making easier to migrate existing Hive queries to Spark and it can also be > convenient for people who are just using SQL to interact with Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21957) Add current_user function
[ https://issues.apache.org/jira/browse/SPARK-21957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21957: Assignee: (was: Apache Spark) > Add current_user function > - > > Key: SPARK-21957 > URL: https://issues.apache.org/jira/browse/SPARK-21957 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Marco Gaido >Priority: Minor > > Spark doesn't support the {{current_user}} function. > Despite the user can be retrieved in other ways, the function would help > making easier to migrate existing Hive queries to Spark and it can also be > convenient for people who are just using SQL to interact with Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21957) Add current_user function
Marco Gaido created SPARK-21957: --- Summary: Add current_user function Key: SPARK-21957 URL: https://issues.apache.org/jira/browse/SPARK-21957 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.2.0 Reporter: Marco Gaido Priority: Minor Spark doesn't support the {{current_user}} function. Despite the user can be retrieved in other ways, the function would help making easier to migrate existing Hive queries to Spark and it can also be convenient for people who are just using SQL to interact with Spark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20958) Roll back parquet-mr 1.8.2 to parquet-1.8.1
[ https://issues.apache.org/jira/browse/SPARK-20958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16155387#comment-16155387 ] Anthony Dotterer edited comment on SPARK-20958 at 9/8/17 2:27 PM: -- As a user of Spark 2.2.0 that mixes usage of parquet-avro and avro, here is the exceptions that I had. This will hopefully make search engines find this library conflict more quickly for others. {code} java.lang.NoClassDefFoundError: org/apache/avro/LogicalType at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144) at org.apache.parquet.avro.AvroParquetWriter.access$100(AvroParquetWriter.java:35) at org.apache.parquet.avro.AvroParquetWriter$Builder.getWriteSupport(AvroParquetWriter.java:173) ... Caused by: java.lang.ClassNotFoundException: org.apache.avro.LogicalType {code} Also when attempting to pin parquet-avro to 1.8.1 with SBT, I'll get the following exception when attempting to write output: {code} java.lang.ExceptionInInitializerError at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:446) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$$anonfun$setSchema$2.apply(ParquetWriteSupport.scala:446) at scala.collection.immutable.List.foreach(List.scala:381) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport$.setSchema(ParquetWriteSupport.scala:446) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat.prepareWrite(ParquetFileFormat.scala:112) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:142) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:145) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.execution.datasources.DataSource.writeInFileFormat(DataSource.scala:438) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:474) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:509) ... Caused by: org.apache.parquet.schema.InvalidSchemaException: A group type can not be empty. Parquet does not support empty group without leaves. Empty group: spark_schema at org.apache.parquet.schema.GroupType.(GroupType.java:92)
[jira] [Assigned] (SPARK-21956) Fetch up to max bytes when buf really released
[ https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21956: Assignee: Apache Spark > Fetch up to max bytes when buf really released > -- > > Key: SPARK-21956 > URL: https://issues.apache.org/jira/browse/SPARK-21956 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.1.0 >Reporter: zhoukang >Assignee: Apache Spark > > Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily > when take from results queue.But currentresult's bytebuf has not been > released at that time.Then direct memory may be a little out of control. > We should decrease bytesInFlight when currentresult's bytebuf really been > released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21956) Fetch up to max bytes when buf really released
[ https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21956: Assignee: (was: Apache Spark) > Fetch up to max bytes when buf really released > -- > > Key: SPARK-21956 > URL: https://issues.apache.org/jira/browse/SPARK-21956 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.1.0 >Reporter: zhoukang > > Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily > when take from results queue.But currentresult's bytebuf has not been > released at that time.Then direct memory may be a little out of control. > We should decrease bytesInFlight when currentresult's bytebuf really been > released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21956) Fetch up to max bytes when buf really released
[ https://issues.apache.org/jira/browse/SPARK-21956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158548#comment-16158548 ] Apache Spark commented on SPARK-21956: -- User 'caneGuy' has created a pull request for this issue: https://github.com/apache/spark/pull/19168 > Fetch up to max bytes when buf really released > -- > > Key: SPARK-21956 > URL: https://issues.apache.org/jira/browse/SPARK-21956 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Affects Versions: 2.1.0 >Reporter: zhoukang > > Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily > when take from results queue.But currentresult's bytebuf has not been > released at that time.Then direct memory may be a little out of control. > We should decrease bytesInFlight when currentresult's bytebuf really been > released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21956) Fetch up to max bytes when buf really released
zhoukang created SPARK-21956: Summary: Fetch up to max bytes when buf really released Key: SPARK-21956 URL: https://issues.apache.org/jira/browse/SPARK-21956 Project: Spark Issue Type: Improvement Components: Shuffle, Spark Core Affects Versions: 2.1.0 Reporter: zhoukang Right now,ShuffleBlockFetcherIterator will decrease bytesInFlight immediatily when take from results queue.But currentresult's bytebuf has not been released at that time.Then direct memory may be a little out of control. We should decrease bytesInFlight when currentresult's bytebuf really been released. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS
[ https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-21942: -- Affects Version/s: (was: 2.2.1) (was: 2.3.0) (was: 3.0.0) (was: 2.0.2) (was: 1.6.3) Target Version/s: (was: 2.3.0) Fix Version/s: (was: 2.3.0) > DiskBlockManager crashing when a root local folder has been externally > deleted by OS > > > Key: SPARK-21942 > URL: https://issues.apache.org/jira/browse/SPARK-21942 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.2.0 >Reporter: Ruslan Shestopalyuk >Priority: Minor > Labels: storage > > _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be > configured via _spark.local.dir_ option, and which defaults to the system's > _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the > _YY_ part is a hash bit, to spread files evenly. > Function _DiskBlockManager.getFile_ expects the top level directories > (_blockmgr-XXX..._) to always exist (they get created once, when the spark > context is first created), otherwise it would fail with a message like: > {code} > ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY > {code} > However, this may not always be the case. > In particular, *if it's the default _/tmp_ folder*, there can be different > strategies of automatically removing files from it, depending on the OS: > * on the boot time > * on a regular basis (e.g. once per day via a system cron job) > * based on the file age > The symptom is that after the process (in our case, a service) using spark is > running for a while (a few days), it may not be able to load files anymore, > since the top-level scratch directories are not there and > _DiskBlockManager.getFile_ crashes. > Please note that this is different from people arbitrarily removing files > manually. > We have both the facts that _/tmp_ is the default in the spark config and > that the system has the right to tamper with its contents, and will do it > with a high probability, after some period of time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21905) ClassCastException when call sqlContext.sql on temp table
[ https://issues.apache.org/jira/browse/SPARK-21905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158496#comment-16158496 ] Kazuaki Ishizaki commented on SPARK-21905: -- While I ran the following code (I do not have PointUDT and Point classes), I cannot see the exception using master branch or branch-2.2. {code} ... import org.apache.spark.sql.catalyst.encoders._ ... import org.apache.spark.sql.types._ test("SPARK-21905") { val schema = StructType(List( StructField("name", DataTypes.StringType, true), StructField("location", new ExamplePointUDT, true))) val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), 4) .map({ x: String => Row.fromSeq(Seq(x, new ExamplePoint(100, 100))) }) val dataFrame = sqlContext.createDataFrame(rowRdd, schema) dataFrame.createOrReplaceTempView("person") sqlContext.sql("SELECT * FROM person").foreach(println(_)) } {code} > ClassCastException when call sqlContext.sql on temp table > - > > Key: SPARK-21905 > URL: https://issues.apache.org/jira/browse/SPARK-21905 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: bluejoe > > {code:java} > val schema = StructType(List( > StructField("name", DataTypes.StringType, true), > StructField("location", new PointUDT, true))) > val rowRdd = sqlContext.sparkContext.parallelize(Seq("bluejoe", "alex"), > 4).map({ x: String ⇒ Row.fromSeq(Seq(x, Point(100, 100))) }); > val dataFrame = sqlContext.createDataFrame(rowRdd, schema) > dataFrame.createOrReplaceTempView("person"); > sqlContext.sql("SELECT * FROM person").foreach(println(_)); > {code} > the last statement throws exception: > {code:java} > Caused by: java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRow cannot be cast to > org.apache.spark.sql.catalyst.InternalRow > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.evalIfFalseExpr1$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287) > ... 18 more > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21951) Unable to add the new column and writing into the Hive using spark
[ https://issues.apache.org/jira/browse/SPARK-21951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jalendhar Baddam reopened SPARK-21951: -- This is still exist. And it's getting the AnalysisException. > Unable to add the new column and writing into the Hive using spark > -- > > Key: SPARK-21951 > URL: https://issues.apache.org/jira/browse/SPARK-21951 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: jalendhar Baddam > > I am creating one new column to the Existing Dataset and unable to write into > the Hive using the Spark. > Ex: Dataset ds=spark.sql("select *from Table"); > ds= ds.withColumn("newColumn",newColumnvalues); > ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); > //Here I am getting the Exception > I am loading the Table from Hive using Spark, and adding the new Column to > that Dataset and again write the same table into Hive with the "OverWrite" > option -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java
[ https://issues.apache.org/jira/browse/SPARK-21952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] jalendhar Baddam reopened SPARK-21952: -- This is still exist ,Please re check > Unable to load the csv file into Dataset using Spark with java > --- > > Key: SPARK-21952 > URL: https://issues.apache.org/jira/browse/SPARK-21952 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: jalendhar Baddam > > Hi, > I am trying to load the one csv file using spark with java ,The csv file > contains the one row with two end lines.I am attaching the csv file .Placing > the sample csv file content. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4502) Spark SQL reads unneccesary nested fields from Parquet
[ https://issues.apache.org/jira/browse/SPARK-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158444#comment-16158444 ] Andriy Kushnir commented on SPARK-4502: --- Just tried this patch on Spark 2.2.0 There are *really huge* performance boost, 5× ≈ 40× approx. [~michael], thanks! > Spark SQL reads unneccesary nested fields from Parquet > -- > > Key: SPARK-4502 > URL: https://issues.apache.org/jira/browse/SPARK-4502 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 1.1.0 >Reporter: Liwen Sun >Priority: Critical > > When reading a field of a nested column from Parquet, SparkSQL reads and > assemble all the fields of that nested column. This is unnecessary, as > Parquet supports fine-grained field reads out of a nested column. This may > degrades the performance significantly when a nested column has many fields. > For example, I loaded json tweets data into SparkSQL and ran the following > query: > {{SELECT User.contributors_enabled from Tweets;}} > User is a nested structure that has 38 primitive fields (for Tweets schema, > see: https://dev.twitter.com/overview/api/tweets), here is the log message: > {{14/11/19 16:36:49 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 3976 ms: 97.02691 rec/ms, 3687.0227 > cell/ms}} > For comparison, I also ran: > {{SELECT User FROM Tweets;}} > And here is the log message: > {{14/11/19 16:45:40 INFO InternalParquetRecordReader: Assembled and processed > 385779 records from 38 columns in 9461 ms: 40.77571 rec/ms, 1549.477 cell/ms}} > So both queries load 38 columns from Parquet, while the first query only > needs 1 column. I also measured the bytes read within Parquet. In these two > cases, the same number of bytes (99365194 bytes) were read. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong
[ https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406 ] Marco Gaido edited comment on SPARK-21944 at 9/8/17 10:31 AM: -- [~KevinZwx] you should define the watermark on the column {{"time"}}, not the column {{"window"}} was (Author: mgaido): [~KevinZwx] you should define the watermark on the column `"time"`, not the column `"window"` > Watermark on window column is wrong > --- > > Key: SPARK-21944 > URL: https://issues.apache.org/jira/browse/SPARK-21944 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > When I use a watermark with dropDuplicates in the following way, the > watermark is calculated wrong > {code:java} > val counts = events.select(window($"time", "5 seconds"), $"time", $"id") > .withWatermark("window", "10 seconds") > .dropDuplicates("id", "window") > .groupBy("window") > .count > {code} > where events is a dataframe with a timestamp column "time" and long column > "id". > I registered a listener to print the event time stats in each batch, and the > results is like the following > {code:shell} > --- > Batch: 0 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > --- > Batch: 1 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > --- > Batch: 2 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > {code} > As can be seen, the event time stats are wrong which are always in > 1970-01-01, so the watermark is calculated wrong. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-19090) Dynamic Resource Allocation not respecting spark.executor.cores
[ https://issues.apache.org/jira/browse/SPARK-19090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16157932#comment-16157932 ] Carlos Vicenti edited comment on SPARK-19090 at 9/8/17 10:19 AM: - I have found the same issue while using Hive On Spark (on Yarn) and spark.dynamicAllocation.enabled set to true {noformat} SET spark.executor.cores=4; SET spark.executor.memory=21G; SET spark.yarn.executor.memoryOverhead=3813; {noformat} >From the application logs: {noformat} 17/09/08 00:30:34 INFO yarn.YarnAllocator: Will request 1 executor containers, each with 6 cores and 25317 MB memory including 3813 MB overhead {noformat} As mentioned above. This does not happen if I set spark.dynamicAllocation.enabled to false. I'm using v1.6 was (Author: cvicenti): I have found the same issue while using Hive On Spark (on Yarn) and spark.dynamicAllocation.enabled set to true {noformat} SET spark.executor.cores=4; SET spark.executor.memory=21G; SET spark.yarn.executor.memoryOverhead=3813; {noformat} >From the application logs: {noformat} 17/09/08 00:30:34 INFO yarn.YarnAllocator: Will request 1 executor containers, each with 6 cores and 25317 MB memory including 3813 MB overhead {noformat} As mentioned above. This does not happen if I set spark.dynamicAllocation.enabled to false > Dynamic Resource Allocation not respecting spark.executor.cores > --- > > Key: SPARK-19090 > URL: https://issues.apache.org/jira/browse/SPARK-19090 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.2, 1.6.1, 2.0.1 >Reporter: nirav patel > > When enabling dynamic scheduling with yarn I see that all executors are using > only 1 core even if I specify "spark.executor.cores" to 6. If dynamic > scheduling is disabled then each executors will have 6 cores. i.e. it > respects "spark.executor.cores". I have tested this against spark 1.5 . I > think it will be the same behavior with 2.x as well. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21944) Watermark on window column is wrong
[ https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406 ] Marco Gaido edited comment on SPARK-21944 at 9/8/17 9:57 AM: - [~KevinZwx] you should define the watermark on the column `"time"`, not the column `"window"` was (Author: mgaido): [~kevinzhang] you should define the watermark on the column `"time"`, not the column `"window"` > Watermark on window column is wrong > --- > > Key: SPARK-21944 > URL: https://issues.apache.org/jira/browse/SPARK-21944 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > When I use a watermark with dropDuplicates in the following way, the > watermark is calculated wrong > {code:java} > val counts = events.select(window($"time", "5 seconds"), $"time", $"id") > .withWatermark("window", "10 seconds") > .dropDuplicates("id", "window") > .groupBy("window") > .count > {code} > where events is a dataframe with a timestamp column "time" and long column > "id". > I registered a listener to print the event time stats in each batch, and the > results is like the following > {code:shell} > --- > Batch: 0 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > --- > Batch: 1 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > --- > Batch: 2 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > {code} > As can be seen, the event time stats are wrong which are always in > 1970-01-01, so the watermark is calculated wrong. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21944) Watermark on window column is wrong
[ https://issues.apache.org/jira/browse/SPARK-21944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158406#comment-16158406 ] Marco Gaido commented on SPARK-21944: - [~kevinzhang] you should define the watermark on the column `"time"`, not the column `"window"` > Watermark on window column is wrong > --- > > Key: SPARK-21944 > URL: https://issues.apache.org/jira/browse/SPARK-21944 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.2.0 >Reporter: Kevin Zhang > > When I use a watermark with dropDuplicates in the following way, the > watermark is calculated wrong > {code:java} > val counts = events.select(window($"time", "5 seconds"), $"time", $"id") > .withWatermark("window", "10 seconds") > .dropDuplicates("id", "window") > .groupBy("window") > .count > {code} > where events is a dataframe with a timestamp column "time" and long column > "id". > I registered a listener to print the event time stats in each batch, and the > results is like the following > {code:shell} > --- > Batch: 0 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T00:00:00.000Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > {watermark=1970-01-01T00:00:00.000Z} > --- > Batch: 1 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|3| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > --- > Batch: 2 > --- > +-+-+ > > |window |count| > +-+-+ > |[2017-09-07 16:55:40.0,2017-09-07 16:55:45.0]|1| > |[2017-09-07 16:55:20.0,2017-09-07 16:55:25.0]|4| > +-+-+ > {min=1970-01-01T19:05:19.476Z, avg=1970-01-01T19:05:19.476Z, > watermark=1970-01-01T19:05:09.476Z, max=1970-01-01T19:05:19.476Z} > {watermark=1970-01-01T19:05:09.476Z} > {code} > As can be seen, the event time stats are wrong which are always in > 1970-01-01, so the watermark is calculated wrong. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
[ https://issues.apache.org/jira/browse/SPARK-21955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158402#comment-16158402 ] Sean Owen commented on SPARK-21955: --- You might be on to something, but this is poorly described. Can you revise the description, attach the image, and specify the change you are suggesting? > OneForOneStreamManager may leak memory when network is poor > --- > > Key: SPARK-21955 > URL: https://issues.apache.org/jira/browse/SPARK-21955 > Project: Spark > Issue Type: Bug > Components: Block Manager >Affects Versions: 1.6.1 > Environment: hdp 2.4.2.0-258 > spark 1.6 >Reporter: poseidon > > just in my way to know how stream , chunk , block works in netty found some > nasty case. > process OpenBlocks message registerStream Stream in OneForOneStreamManager > org.apache.spark.network.server.OneForOneStreamManager#registerStream > fill with streamState with app & buber > process ChunkFetchRequest registerChannel > org.apache.spark.network.server.OneForOneStreamManager#registerChannel > fill with streamState with channel > In > org.apache.spark.network.shuffle.OneForOneBlockFetcher#start > OpenBlocks -> ChunkFetchRequest come in sequnce. > If network down in OpenBlocks process, no more ChunkFetchRequest message > then. > So, we can see some leaked Buffer in OneForOneStreamManager > !attachment-name.jpg|thumbnail! > if > org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel > is not set, then after search the code , it will remain in memory forever. > Because the only way to release it was in channel close , or someone read the > last piece of block. > OneForOneStreamManager#registerStream we can set channel in this method, just > in case of this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21955) OneForOneStreamManager may leak memory when network is poor
poseidon created SPARK-21955: Summary: OneForOneStreamManager may leak memory when network is poor Key: SPARK-21955 URL: https://issues.apache.org/jira/browse/SPARK-21955 Project: Spark Issue Type: Bug Components: Block Manager Affects Versions: 1.6.1 Environment: hdp 2.4.2.0-258 spark 1.6 Reporter: poseidon just in my way to know how stream , chunk , block works in netty found some nasty case. process OpenBlocks message registerStream Stream in OneForOneStreamManager org.apache.spark.network.server.OneForOneStreamManager#registerStream fill with streamState with app & buber process ChunkFetchRequest registerChannel org.apache.spark.network.server.OneForOneStreamManager#registerChannel fill with streamState with channel In org.apache.spark.network.shuffle.OneForOneBlockFetcher#start OpenBlocks -> ChunkFetchRequest come in sequnce. If network down in OpenBlocks process, no more ChunkFetchRequest message then. So, we can see some leaked Buffer in OneForOneStreamManager !attachment-name.jpg|thumbnail! if org.apache.spark.network.server.OneForOneStreamManager.StreamState#associatedChannel is not set, then after search the code , it will remain in memory forever. Because the only way to release it was in channel close , or someone read the last piece of block. OneForOneStreamManager#registerStream we can set channel in this method, just in case of this case. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS
[ https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158320#comment-16158320 ] Saisai Shao commented on SPARK-21942: - Personally I would like to fail fast if such things happened, here it happened to clean the root folder and using {{mkdirs}} can handle this issue, but if some persistent block or shuffle index file is removed (because it is closed), I think there's no way to handle it. So instead of trying to workaround it, exposing an exception to user might be more useful, and will let user to know the issue earlier. > DiskBlockManager crashing when a root local folder has been externally > deleted by OS > > > Key: SPARK-21942 > URL: https://issues.apache.org/jira/browse/SPARK-21942 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, > 2.2.0, 2.2.1, 2.3.0, 3.0.0 >Reporter: Ruslan Shestopalyuk >Priority: Minor > Labels: storage > Fix For: 2.3.0 > > > _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be > configured via _spark.local.dir_ option, and which defaults to the system's > _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the > _YY_ part is a hash bit, to spread files evenly. > Function _DiskBlockManager.getFile_ expects the top level directories > (_blockmgr-XXX..._) to always exist (they get created once, when the spark > context is first created), otherwise it would fail with a message like: > {code} > ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY > {code} > However, this may not always be the case. > In particular, *if it's the default _/tmp_ folder*, there can be different > strategies of automatically removing files from it, depending on the OS: > * on the boot time > * on a regular basis (e.g. once per day via a system cron job) > * based on the file age > The symptom is that after the process (in our case, a service) using spark is > running for a while (a few days), it may not be able to load files anymore, > since the top-level scratch directories are not there and > _DiskBlockManager.getFile_ crashes. > Please note that this is different from people arbitrarily removing files > manually. > We have both the facts that _/tmp_ is the default in the spark config and > that the system has the right to tamper with its contents, and will do it > with a high probability, after some period of time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS
[ https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158304#comment-16158304 ] Ruslan Shestopalyuk commented on SPARK-21942: - [~jerryshao] I believe the only objective reason here would be to make the Spark code more robust. Regarding the rest - I agree it's not a valid issue, since if problem like this happens, one can always spend some time debugging the Spark code and realize what a workaround could be. Also, hopefully this very page gets indexed in the search engines, so maybe even that won't be needed :) > DiskBlockManager crashing when a root local folder has been externally > deleted by OS > > > Key: SPARK-21942 > URL: https://issues.apache.org/jira/browse/SPARK-21942 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, > 2.2.0, 2.2.1, 2.3.0, 3.0.0 >Reporter: Ruslan Shestopalyuk >Priority: Minor > Labels: storage > Fix For: 2.3.0 > > > _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be > configured via _spark.local.dir_ option, and which defaults to the system's > _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the > _YY_ part is a hash bit, to spread files evenly. > Function _DiskBlockManager.getFile_ expects the top level directories > (_blockmgr-XXX..._) to always exist (they get created once, when the spark > context is first created), otherwise it would fail with a message like: > {code} > ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY > {code} > However, this may not always be the case. > In particular, *if it's the default _/tmp_ folder*, there can be different > strategies of automatically removing files from it, depending on the OS: > * on the boot time > * on a regular basis (e.g. once per day via a system cron job) > * based on the file age > The symptom is that after the process (in our case, a service) using spark is > running for a while (a few days), it may not be able to load files anymore, > since the top-level scratch directories are not there and > _DiskBlockManager.getFile_ crashes. > Please note that this is different from people arbitrarily removing files > manually. > We have both the facts that _/tmp_ is the default in the spark config and > that the system has the right to tamper with its contents, and will do it > with a high probability, after some period of time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
[ https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158300#comment-16158300 ] Apache Spark commented on SPARK-21954: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19167 > JacksonUtils should verify MapType's value type instead of key type > --- > > Key: SPARK-21954 > URL: https://issues.apache.org/jira/browse/SPARK-21954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{JacksonUtils.verifySchema}} verifies if a data type can be converted to > JSON. For {{MapType}}, it now verifies the key type. However, in > {{JacksonGenerator}}, when converting a map to JSON, we only care about its > values and create a writer for the values. The keys in a map are treated as > strings by calling {{toString}} on the keys. > Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type > of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
[ https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21954: Assignee: (was: Apache Spark) > JacksonUtils should verify MapType's value type instead of key type > --- > > Key: SPARK-21954 > URL: https://issues.apache.org/jira/browse/SPARK-21954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh > > {{JacksonUtils.verifySchema}} verifies if a data type can be converted to > JSON. For {{MapType}}, it now verifies the key type. However, in > {{JacksonGenerator}}, when converting a map to JSON, we only care about its > values and create a writer for the values. The keys in a map are treated as > strings by calling {{toString}} on the keys. > Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type > of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
[ https://issues.apache.org/jira/browse/SPARK-21954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21954: Assignee: Apache Spark > JacksonUtils should verify MapType's value type instead of key type > --- > > Key: SPARK-21954 > URL: https://issues.apache.org/jira/browse/SPARK-21954 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Liang-Chi Hsieh >Assignee: Apache Spark > > {{JacksonUtils.verifySchema}} verifies if a data type can be converted to > JSON. For {{MapType}}, it now verifies the key type. However, in > {{JacksonGenerator}}, when converting a map to JSON, we only care about its > values and create a writer for the values. The keys in a map are treated as > strings by calling {{toString}} on the keys. > Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type > of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21954) JacksonUtils should verify MapType's value type instead of key type
Liang-Chi Hsieh created SPARK-21954: --- Summary: JacksonUtils should verify MapType's value type instead of key type Key: SPARK-21954 URL: https://issues.apache.org/jira/browse/SPARK-21954 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.2.0 Reporter: Liang-Chi Hsieh {{JacksonUtils.verifySchema}} verifies if a data type can be converted to JSON. For {{MapType}}, it now verifies the key type. However, in {{JacksonGenerator}}, when converting a map to JSON, we only care about its values and create a writer for the values. The keys in a map are treated as strings by calling {{toString}} on the keys. Thus, we should change {{JacksonUtils.verifySchema}} to verify the value type of {{MapType}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21128) Running R tests multiple times failed due to pre-exiting "spark-warehouse" / "metastore_db"
[ https://issues.apache.org/jira/browse/SPARK-21128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158276#comment-16158276 ] Apache Spark commented on SPARK-21128: -- User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/19166 > Running R tests multiple times failed due to pre-exiting "spark-warehouse" / > "metastore_db" > --- > > Key: SPARK-21128 > URL: https://issues.apache.org/jira/browse/SPARK-21128 > Project: Spark > Issue Type: Test > Components: SparkR >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Minor > Fix For: 2.3.0 > > > Currently, running R tests multiple times fails due to pre-exiting > "spark-warehouse" / "metastore_db" as below: > {code} > SparkSQL functions: Spark package found in SPARK_HOME: .../spark > ...1234... > {code} > {code} > Failed > - > 1. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3384) > length(list1) not equal to length(list2). > 1/1 mismatches > [1] 25 - 23 == 2 > 2. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3384) > sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). > 10/25 mismatches > x[16]: "metastore_db" > y[16]: "pkg" > x[17]: "pkg" > y[17]: "R" > x[18]: "R" > y[18]: "README.md" > x[19]: "README.md" > y[19]: "run-tests.sh" > x[20]: "run-tests.sh" > y[20]: "SparkR_2.2.0.tar.gz" > x[21]: "metastore_db" > y[21]: "pkg" > x[22]: "pkg" > y[22]: "R" > x[23]: "R" > y[23]: "README.md" > x[24]: "README.md" > y[24]: "run-tests.sh" > x[25]: "run-tests.sh" > y[25]: "SparkR_2.2.0.tar.gz" > 3. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3388) > length(list1) not equal to length(list2). > 1/1 mismatches > [1] 25 - 23 == 2 > 4. Failure: No extra files are created in SPARK_HOME by starting session and > making calls (@test_sparkSQL.R#3388) > sort(list1, na.last = TRUE) not equal to sort(list2, na.last = TRUE). > 10/25 mismatches > x[16]: "metastore_db" > y[16]: "pkg" > x[17]: "pkg" > y[17]: "R" > x[18]: "R" > y[18]: "README.md" > x[19]: "README.md" > y[19]: "run-tests.sh" > x[20]: "run-tests.sh" > y[20]: "SparkR_2.2.0.tar.gz" > x[21]: "metastore_db" > y[21]: "pkg" > x[22]: "pkg" > y[22]: "R" > x[23]: "R" > y[23]: "README.md" > x[24]: "README.md" > y[24]: "run-tests.sh" > x[25]: "run-tests.sh" > y[25]: "SparkR_2.2.0.tar.gz" > DONE > === > {code} > It looks we should remove both "spark-warehouse" and "metastore_db" _before_ > listing files into {{sparkRFilesBefore}}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21942) DiskBlockManager crashing when a root local folder has been externally deleted by OS
[ https://issues.apache.org/jira/browse/SPARK-21942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158271#comment-16158271 ] Saisai Shao commented on SPARK-21942: - {quote} https://github.com/search?utf8=%E2%9C%93=filename%3Aspark-defaults.conf++NOT+spark.local.dir=Code shows 2000+ repos that omit the `spark.local.dir` setting altogether, which means they are using `/tmp`, even though it's not a good default choice. Which of course does not prove anything, since those are not necessarily "production environments". {quote} [~rshest] you can always find out reasons, but I don't think this is a valid issue. > DiskBlockManager crashing when a root local folder has been externally > deleted by OS > > > Key: SPARK-21942 > URL: https://issues.apache.org/jira/browse/SPARK-21942 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, > 2.2.0, 2.2.1, 2.3.0, 3.0.0 >Reporter: Ruslan Shestopalyuk >Priority: Minor > Labels: storage > Fix For: 2.3.0 > > > _DiskBlockManager_ has a notion of a "scratch" local folder(s), which can be > configured via _spark.local.dir_ option, and which defaults to the system's > _/tmp_. The hierarchy is two-level, e.g. _/blockmgr-XXX.../YY_, where the > _YY_ part is a hash bit, to spread files evenly. > Function _DiskBlockManager.getFile_ expects the top level directories > (_blockmgr-XXX..._) to always exist (they get created once, when the spark > context is first created), otherwise it would fail with a message like: > {code} > ... java.io.IOException: Failed to create local dir in /tmp/blockmgr-XXX.../YY > {code} > However, this may not always be the case. > In particular, *if it's the default _/tmp_ folder*, there can be different > strategies of automatically removing files from it, depending on the OS: > * on the boot time > * on a regular basis (e.g. once per day via a system cron job) > * based on the file age > The symptom is that after the process (in our case, a service) using spark is > running for a while (a few days), it may not be able to load files anymore, > since the top-level scratch directories are not there and > _DiskBlockManager.getFile_ crashes. > Please note that this is different from people arbitrarily removing files > manually. > We have both the facts that _/tmp_ is the default in the spark config and > that the system has the right to tamper with its contents, and will do it > with a high probability, after some period of time. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21953) Show both memory and disk bytes spilled if either is present
[ https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158263#comment-16158263 ] Apache Spark commented on SPARK-21953: -- User 'ash211' has created a pull request for this issue: https://github.com/apache/spark/pull/19164 > Show both memory and disk bytes spilled if either is present > > > Key: SPARK-21953 > URL: https://issues.apache.org/jira/browse/SPARK-21953 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Minor > > https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61 > should be {{||}} not {{&&}} > As written now, there must be both memory and disk bytes spilled to show > either of them. If there is only one of those types of spill recorded, it > will be hidden. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21953) Show both memory and disk bytes spilled if either is present
[ https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21953: Assignee: Apache Spark > Show both memory and disk bytes spilled if either is present > > > Key: SPARK-21953 > URL: https://issues.apache.org/jira/browse/SPARK-21953 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Assignee: Apache Spark >Priority: Minor > > https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61 > should be {{||}} not {{&&}} > As written now, there must be both memory and disk bytes spilled to show > either of them. If there is only one of those types of spill recorded, it > will be hidden. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21953) Show both memory and disk bytes spilled if either is present
[ https://issues.apache.org/jira/browse/SPARK-21953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21953: Assignee: (was: Apache Spark) > Show both memory and disk bytes spilled if either is present > > > Key: SPARK-21953 > URL: https://issues.apache.org/jira/browse/SPARK-21953 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 2.2.0 >Reporter: Andrew Ash >Priority: Minor > > https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61 > should be {{||}} not {{&&}} > As written now, there must be both memory and disk bytes spilled to show > either of them. If there is only one of those types of spill recorded, it > will be hidden. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21953) Show both memory and disk bytes spilled if either is present
Andrew Ash created SPARK-21953: -- Summary: Show both memory and disk bytes spilled if either is present Key: SPARK-21953 URL: https://issues.apache.org/jira/browse/SPARK-21953 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 2.2.0 Reporter: Andrew Ash Priority: Minor https://github.com/apache/spark/commit/a1f0992faefbe042a9cb7a11842a817c958e4797#diff-fa4cfb2cce1b925f55f41f2dfa8c8501R61 should be {{||}} not {{&&}} As written now, there must be both memory and disk bytes spilled to show either of them. If there is only one of those types of spill recorded, it will be hidden. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18085) SPIP: Better History Server scalability for many / large applications
[ https://issues.apache.org/jira/browse/SPARK-18085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158247#comment-16158247 ] jincheng commented on SPARK-18085: -- {code:java} com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException: Unrecognized field "metadata" (class org.apache.spark.sql.execution.SparkPlanInfo), not marked as ignorable (4 known properties: "simpleString", "nodeName", "children", "metrics"]) at [Source: {"Event":"org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart","executionId":0,"description":"json at NativeMethodAccessorImpl.java:0","details":"org.apache.spark.sql.DataFrameWriter.json(DataFrameWriter.scala:487)\nsun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)\nsun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)\nsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)\njava.lang.reflect.Method.invoke(Method.java:498)\npy4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)\npy4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)\npy4j.Gateway.invoke(Gateway.java:280)\npy4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)\npy4j.commands.CallCommand.execute(CallCommand.java:79)\npy4j.GatewayConnection.run(GatewayConnection.java:214)\njava.lang.Thread.run(Thread.java:748)","physicalPlanDescription":"== Parsed Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, gids#328]\n\n== Analyzed Logical Plan ==\nuid: bigint, gids: array\nRepartition 200, true\n+- LogicalRDD [uid#327L, gids#328]\n\n== Optimized Logical Plan ==\nRepartition 200, true\n+- LogicalRDD [uid#327L, gids#328]\n\n== Physical Plan ==\nExchange RoundRobinPartitioning(200)\n+- Scan ExistingRDD[uid#327L,gids#328]","sparkPlanInfo":{"nodeName":"Exchange","simpleString":"Exchange RoundRobinPartitioning(200)","children":[{"nodeName":"ExistingRDD","simpleString":"Scan ExistingRDD[uid#327L,gids#328]","children":[],"metadata":{},"metrics":[{"name":"number of output rows","accumulatorId":140,"metricType":"sum"}]}],"metadata":{},"metrics":[{"name":"data size total (min, med, max)","accumulatorId":139,"metricType":"size"}]},"time":1504837052948}; line: 1, column: 1622] (through reference chain: org.apache.spark.sql.execution.ui.SparkListenerSQLExecutionStart["sparkPlanInfo"]->org.apache.spark.sql.execution.SparkPlanInfo["children"]->com.fasterxml.jackson.module.scala.deser.BuilderWrapper[0]->org.apache.spark.sql.execution.SparkPlanInfo["metadata"]) at com.fasterxml.jackson.databind.exc.UnrecognizedPropertyException.from(UnrecognizedPropertyException.java:51) at com.fasterxml.jackson.databind.DeserializationContext.reportUnknownProperty(DeserializationContext.java:839) at com.fasterxml.jackson.databind.deser.std.StdDeserializer.handleUnknownProperty(StdDeserializer.java:1045) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperty(BeanDeserializerBase.java:1352) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.handleUnknownProperties(BeanDeserializerBase.java:1306) at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:399) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:296) at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:133) at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:245) at com.fasterxml.jackson.databind.deser.std.CollectionDeserializer.deserialize(CollectionDeserializer.java:217) at com.fasterxml.jackson.module.scala.deser.SeqDeserializer.deserialize(SeqDeserializerModule.scala:76) at com.fasterxml.jackson.module.scala.deser.SeqDeserializer.deserialize(SeqDeserializerModule.scala:59) at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:520) at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeWithErrorWrapping(BeanDeserializer.java:463) at com.fasterxml.jackson.databind.deser.BeanDeserializer._deserializeUsingPropertyBased(BeanDeserializer.java:378) at com.fasterxml.jackson.databind.deser.BeanDeserializerBase.deserializeFromObjectUsingNonDefault(BeanDeserializerBase.java:1099) at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserializeFromObject(BeanDeserializer.java:296) at com.fasterxml.jackson.databind.deser.BeanDeserializer.deserialize(BeanDeserializer.java:133) at com.fasterxml.jackson.databind.deser.SettableBeanProperty.deserialize(SettableBeanProperty.java:520) at
[jira] [Commented] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158241#comment-16158241 ] Apache Spark commented on SPARK-21936: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/19163 > backward compatibility test framework for HiveExternalCatalog > - > > Key: SPARK-21936 > URL: https://issues.apache.org/jira/browse/SPARK-21936 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21726) Check for structural integrity of the plan in QO in test mode
[ https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158224#comment-16158224 ] Apache Spark commented on SPARK-21726: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/19161 > Check for structural integrity of the plan in QO in test mode > - > > Key: SPARK-21726 > URL: https://issues.apache.org/jira/browse/SPARK-21726 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > Right now we don't have any checks in the optimizer to check for the > structural integrity of the plan (e.g. resolved). It would be great if in > test mode, we can check whether a plan is still resolved after the execution > of each rule, so we can catch rules that return invalid plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21931) add LNNVL function
[ https://issues.apache.org/jira/browse/SPARK-21931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21931. --- Resolution: Won't Fix > add LNNVL function > -- > > Key: SPARK-21931 > URL: https://issues.apache.org/jira/browse/SPARK-21931 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Ruslan Dautkhanov >Priority: Minor > Attachments: Capture1.JPG > > > Purpose > LNNVL provides a concise way to evaluate a condition when one or both > operands of the condition may be null. The function can be used only in the > WHERE clause of a query. It takes as an argument a condition and returns TRUE > if the condition is FALSE or UNKNOWN and FALSE if the condition is TRUE. > LNNVL can be used anywhere a scalar expression can appear, even in contexts > where the IS (NOT) NULL, AND, or OR conditions are not valid but would > otherwise be required to account for potential nulls. Oracle Database > sometimes uses the LNNVL function internally in this way to rewrite NOT IN > conditions as NOT EXISTS conditions. In such cases, output from EXPLAIN PLAN > shows this operation in the plan table output. The condition can evaluate any > scalar values but cannot be a compound condition containing AND, OR, or > BETWEEN. > The table that follows shows what LNNVL returns given that a = 2 and b is > null. > !Capture1.JPG! > https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions078.htm -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21915) Model 1 and Model 2 ParamMaps Missing
[ https://issues.apache.org/jira/browse/SPARK-21915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21915. --- Resolution: Fixed Fix Version/s: 2.2.1 Issue resolved by pull request 19152 [https://github.com/apache/spark/pull/19152] > Model 1 and Model 2 ParamMaps Missing > - > > Key: SPARK-21915 > URL: https://issues.apache.org/jira/browse/SPARK-21915 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 2.0.0, > 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.2.0 >Reporter: Mark Tabladillo >Priority: Minor > Labels: easyfix > Fix For: 2.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Error in PySpark example code > [https://github.com/apache/spark/blob/master/examples/src/main/python/ml/estimator_transformer_param_example.py] > The original Scala code says > println("Model 2 was fit using parameters: " + model2.parent.extractParamMap) > The parent is lr > There is no method for accessing parent as is done in Scala. > > This code has been tested in Python, and returns values consistent with Scala > Proposing to call the lr variable instead of model1 or model2 > > This patch was tested with Spark 2.1.0 comparing the Scala and PySpark > results. Pyspark returns nothing at present for those two print lines. > The output for model2 in PySpark should be > {Param(parent='LogisticRegression_4187be538f744d5a9090', name='tol', doc='the > convergence tolerance for iterative algorithms (>= 0).'): 1e-06, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, > 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 > penalty.'): 0.0, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='predictionCol', > doc='prediction column name.'): 'prediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='featuresCol', > doc='features column name.'): 'features', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='labelCol', > doc='label column name.'): 'label', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='probabilityCol', doc='Column name for predicted class conditional > probabilities. Note: Not all models output well-calibrated probability > estimates! These probabilities should be treated as confidences, not precise > probabilities.'): 'myProbability', > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column > name.'): 'rawPrediction', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='family', > doc='The name of family which is a description of the label distribution to > be used in the model. Supported options: auto, binomial, multinomial'): > 'auto', > Param(parent='LogisticRegression_4187be538f744d5a9090', name='fitIntercept', > doc='whether to fit an intercept term.'): True, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='threshold', > doc='Threshold in binary classification prediction, in range [0, 1]. If > threshold and thresholds are both set, they must match.e.g. if threshold is > p, then thresholds must be equal to [1-p, p].'): 0.55, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='maxIter', > doc='max number of iterations (>= 0).'): 30, > Param(parent='LogisticRegression_4187be538f744d5a9090', name='regParam', > doc='regularization parameter (>= 0).'): 0.1, > Param(parent='LogisticRegression_4187be538f744d5a9090', > name='standardization', doc='whether to standardize the training features > before fitting the model.'): True} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21951) Unable to add the new column and writing into the Hive using spark
[ https://issues.apache.org/jira/browse/SPARK-21951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21951. --- Resolution: Invalid This doesn't express a problem, and questions should go to the mailing list > Unable to add the new column and writing into the Hive using spark > -- > > Key: SPARK-21951 > URL: https://issues.apache.org/jira/browse/SPARK-21951 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: jalendhar Baddam > > I am creating one new column to the Existing Dataset and unable to write into > the Hive using the Spark. > Ex: Dataset ds=spark.sql("select *from Table"); > ds= ds.withColumn("newColumn",newColumnvalues); > ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); > //Here I am getting the Exception > I am loading the Table from Hive using Spark, and adding the new Column to > that Dataset and again write the same table into Hive with the "OverWrite" > option -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java
[ https://issues.apache.org/jira/browse/SPARK-21952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-21952. --- Resolution: Invalid Spam > Unable to load the csv file into Dataset using Spark with java > --- > > Key: SPARK-21952 > URL: https://issues.apache.org/jira/browse/SPARK-21952 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 2.1.1 >Reporter: jalendhar Baddam > > Hi, > I am trying to load the one csv file using spark with java ,The csv file > contains the one row with two end lines.I am attaching the csv file .Placing > the sample csv file content. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21952) Unable to load the csv file into Dataset using Spark with java
jalendhar Baddam created SPARK-21952: Summary: Unable to load the csv file into Dataset using Spark with java Key: SPARK-21952 URL: https://issues.apache.org/jira/browse/SPARK-21952 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.1 Reporter: jalendhar Baddam Hi, I am trying to load the one csv file using spark with java ,The csv file contains the one row with two end lines.I am attaching the csv file .Placing the sample csv file content. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21951) Unable to add the new column and writing into the Hive using spark
jalendhar Baddam created SPARK-21951: Summary: Unable to add the new column and writing into the Hive using spark Key: SPARK-21951 URL: https://issues.apache.org/jira/browse/SPARK-21951 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 2.1.1 Reporter: jalendhar Baddam I am creating one new column to the Existing Dataset and unable to write into the Hive using the Spark. Ex: Dataset ds=spark.sql("select *from Table"); ds= ds.withColumn("newColumn",newColumnvalues); ds.write().saveMode("overwite").format("parquet').saveAsTable("Table"); //Here I am getting the Exception I am loading the Table from Hive using Spark, and adding the new Column to that Dataset and again write the same table into Hive with the "OverWrite" option -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-650) Add a "setup hook" API for running initialization code on each executor
[ https://issues.apache.org/jira/browse/SPARK-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158202#comment-16158202 ] yiming.xu commented on SPARK-650: - I need a hook too. Some case, We need init something like spring initbean :( > Add a "setup hook" API for running initialization code on each executor > --- > > Key: SPARK-650 > URL: https://issues.apache.org/jira/browse/SPARK-650 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Matei Zaharia >Priority: Minor > > Would be useful to configure things like reporting libraries -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21067) Thrift Server - CTAS fail with Unable to move source
[ https://issues.apache.org/jira/browse/SPARK-21067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158192#comment-16158192 ] xinzhang commented on SPARK-21067: -- hi [~dricard] do u have any solutions now? any suggests will helpful. > Thrift Server - CTAS fail with Unable to move source > > > Key: SPARK-21067 > URL: https://issues.apache.org/jira/browse/SPARK-21067 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.1, 2.2.0 > Environment: Yarn > Hive MetaStore > HDFS (HA) >Reporter: Dominic Ricard > > After upgrading our Thrift cluster to 2.1.1, we ran into an issue where CTAS > would fail, sometimes... > Most of the time, the CTAS would work only once, after starting the thrift > server. After that, dropping the table and re-issuing the same CTAS would > fail with the following message (Sometime, it fails right away, sometime it > work for a long period of time): > {noformat} > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1//tmp/hive-staging/thrift_hive_2017-06-12_16-56-18_464_7598877199323198104-31/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > We have already found the following Jira > (https://issues.apache.org/jira/browse/SPARK-11021) which state that the > {{hive.exec.stagingdir}} had to be added in order for Spark to be able to > handle CREATE TABLE properly as of 2.0. As you can see in the error, we have > ours set to "/tmp/hive-staging/\{user.name\}" > Same issue with INSERT statements: > {noformat} > CREATE TABLE IF NOT EXISTS dricard.test (col1 int); INSERT INTO TABLE > dricard.test SELECT 1; > Error: org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-12_20-41-12_964_3086448130033637241-16/-ext-1/part-0 > to destination > hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > (state=,code=0) > {noformat} > This worked fine in 1.6.2, which we currently run in our Production > Environment but since 2.0+, we haven't been able to CREATE TABLE consistently > on the cluster. > SQL to reproduce issue: > {noformat} > DROP SCHEMA IF EXISTS dricard CASCADE; > CREATE SCHEMA dricard; > CREATE TABLE dricard.test (col1 int); > INSERT INTO TABLE dricard.test SELECT 1; > SELECT * from dricard.test; > DROP TABLE dricard.test; > CREATE TABLE dricard.test AS select 1 as `col1`; > SELECT * from dricard.test > {noformat} > Thrift server usually fails at INSERT... > Tried the same procedure in a spark context using spark.sql() and didn't > encounter the same issue. > Full stack Trace: > {noformat} > 17/06/14 14:52:18 ERROR thriftserver.SparkExecuteStatementOperation: Error > executing query, currentState RUNNING, > org.apache.spark.sql.AnalysisException: > org.apache.hadoop.hive.ql.metadata.HiveException: Unable to move source > hdfs://nameservice1/tmp/hive-staging/thrift_hive_2017-06-14_14-52-18_521_5906917519254880890-5/-ext-1/part-0 > to desti > nation hdfs://nameservice1/user/hive/warehouse/dricard.db/test/part-0; > at > org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:106) > at > org.apache.spark.sql.hive.HiveExternalCatalog.loadTable(HiveExternalCatalog.scala:766) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult$lzycompute(InsertIntoHiveTable.scala:374) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.sideEffectResult(InsertIntoHiveTable.scala:221) > at > org.apache.spark.sql.hive.execution.InsertIntoHiveTable.doExecute(InsertIntoHiveTable.scala:407) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132) > at > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) > at org.apache.spark.sql.Dataset.(Dataset.scala:185) > at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:64) > at
[jira] [Commented] (SPARK-18492) GeneratedIterator grows beyond 64 KB
[ https://issues.apache.org/jira/browse/SPARK-18492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158189#comment-16158189 ] cen yuhai commented on SPARK-18492: --- spark 2.1.1 also have this problem > GeneratedIterator grows beyond 64 KB > > > Key: SPARK-18492 > URL: https://issues.apache.org/jira/browse/SPARK-18492 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.1 > Environment: CentOS release 6.7 (Final) >Reporter: Norris Merritt > > spark-submit fails with ERROR CodeGenerator: failed to compile: > org.codehaus.janino.JaninoRuntimeException: Code of method > "(I[Lscala/collection/Iterator;)V" of class > "org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator" > grows beyond 64 KB > Error message is followed by a huge dump of generated source code. > The generated code declares 1,454 field sequences like the following: > /* 036 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1; > /* 037 */ private scala.Function1 project_catalystConverter1; > /* 038 */ private scala.Function1 project_converter1; > /* 039 */ private scala.Function1 project_converter2; > /* 040 */ private scala.Function2 project_udf1; > (many omitted lines) ... > /* 6089 */ private org.apache.spark.sql.catalyst.expressions.ScalaUDF > project_scalaUDF1454; > /* 6090 */ private scala.Function1 project_catalystConverter1454; > /* 6091 */ private scala.Function1 project_converter1695; > /* 6092 */ private scala.Function1 project_udf1454; > It then proceeds to emit code for several methods (init, processNext) each of > which has totally repetitive sequences of statements pertaining to each of > the sequences of variables declared in the class. For example: > /* 6101 */ public void init(int index, scala.collection.Iterator inputs[]) { > The reason that the 64KB JVM limit for code for a method is exceeded is > because the code generator is using an incredibly naive strategy. It emits a > sequence like the one shown below for each of the 1,454 groups of variables > shown above, in > /* 6132 */ this.project_udf = > (scala.Function1)project_scalaUDF.userDefinedFunc(); > /* 6133 */ this.project_scalaUDF1 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[10]; > /* 6134 */ this.project_catalystConverter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF1.dataType()); > /* 6135 */ this.project_converter1 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(0))).dataType()); > /* 6136 */ this.project_converter2 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToScalaConverter(((org.apache.spark.sql.catalyst.expressions.Expression)(((org.apache.spark.sql.catalyst.expressions.ScalaUDF)references[10]).getChildren().apply(1))).dataType()); > It blows up after emitting 230 such sequences, while trying to emit the 231st: > /* 7282 */ this.project_udf230 = > (scala.Function2)project_scalaUDF230.userDefinedFunc(); > /* 7283 */ this.project_scalaUDF231 = > (org.apache.spark.sql.catalyst.expressions.ScalaUDF) references[240]; > /* 7284 */ this.project_catalystConverter231 = > (scala.Function1)org.apache.spark.sql.catalyst.CatalystTypeConverters$.MODULE$.createToCatalystConverter(project_scalaUDF231.dataType()); > many omitted lines ... > Example of repetitive code sequences emitted for processNext method: > /* 12253 */ boolean project_isNull247 = project_result244 == null; > /* 12254 */ MapData project_value247 = null; > /* 12255 */ if (!project_isNull247) { > /* 12256 */ project_value247 = project_result244; > /* 12257 */ } > /* 12258 */ Object project_arg = sort_isNull5 ? null : > project_converter489.apply(sort_value5); > /* 12259 */ > /* 12260 */ ArrayData project_result249 = null; > /* 12261 */ try { > /* 12262 */ project_result249 = > (ArrayData)project_catalystConverter248.apply(project_udf248.apply(project_arg)); > /* 12263 */ } catch (Exception e) { > /* 12264 */ throw new > org.apache.spark.SparkException(project_scalaUDF248.udfErrorMessage(), e); > /* 12265 */ } > /* 12266 */ > /* 12267 */ boolean project_isNull252 = project_result249 == null; > /* 12268 */ ArrayData project_value252 = null; > /* 12269 */ if (!project_isNull252) { > /* 12270 */ project_value252 = project_result249; > /* 12271 */ } > /* 12272 */ Object project_arg1 = project_isNull252 ? null :
[jira] [Resolved] (SPARK-21936) backward compatibility test framework for HiveExternalCatalog
[ https://issues.apache.org/jira/browse/SPARK-21936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21936. - Resolution: Fixed Fix Version/s: 2.3.0 > backward compatibility test framework for HiveExternalCatalog > - > > Key: SPARK-21936 > URL: https://issues.apache.org/jira/browse/SPARK-21936 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21934) Expose Netty memory usage via Metrics System
[ https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21934: Assignee: (was: Apache Spark) > Expose Netty memory usage via Metrics System > > > Key: SPARK-21934 > URL: https://issues.apache.org/jira/browse/SPARK-21934 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao > > This is a follow-up work of SPARK-9104 to expose the Netty memory usage to > MetricsSystem. My initial thought is to only expose Shuffle memory usage, > since shuffle is a major part of memory usage in network communication > compared to RPC, file server, block transfer. > If user wants to also expose Netty memory usage for other modules, we could > add more metrics later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21934) Expose Netty memory usage via Metrics System
[ https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158184#comment-16158184 ] Apache Spark commented on SPARK-21934: -- User 'jerryshao' has created a pull request for this issue: https://github.com/apache/spark/pull/19160 > Expose Netty memory usage via Metrics System > > > Key: SPARK-21934 > URL: https://issues.apache.org/jira/browse/SPARK-21934 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao > > This is a follow-up work of SPARK-9104 to expose the Netty memory usage to > MetricsSystem. My initial thought is to only expose Shuffle memory usage, > since shuffle is a major part of memory usage in network communication > compared to RPC, file server, block transfer. > If user wants to also expose Netty memory usage for other modules, we could > add more metrics later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21934) Expose Netty memory usage via Metrics System
[ https://issues.apache.org/jira/browse/SPARK-21934?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21934: Assignee: Apache Spark > Expose Netty memory usage via Metrics System > > > Key: SPARK-21934 > URL: https://issues.apache.org/jira/browse/SPARK-21934 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.3.0 >Reporter: Saisai Shao >Assignee: Apache Spark > > This is a follow-up work of SPARK-9104 to expose the Netty memory usage to > MetricsSystem. My initial thought is to only expose Shuffle memory usage, > since shuffle is a major part of memory usage in network communication > compared to RPC, file server, block transfer. > If user wants to also expose Netty memory usage for other modules, we could > add more metrics later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
[ https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21946: Assignee: (was: Apache Spark) > Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table` > > > Key: SPARK-21946 > URL: https://issues.apache.org/jira/browse/SPARK-21946 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > According to the [Apache Spark Jenkins > History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/] > InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. > We had better stablize this. > {code} > - alter table: rename cached table !!! CANCELED !!! > Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data > (DDLSuite.scala:786) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
[ https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-21946: Assignee: Apache Spark > Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table` > > > Key: SPARK-21946 > URL: https://issues.apache.org/jira/browse/SPARK-21946 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Minor > > According to the [Apache Spark Jenkins > History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/] > InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. > We had better stablize this. > {code} > - alter table: rename cached table !!! CANCELED !!! > Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data > (DDLSuite.scala:786) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21946) Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table`
[ https://issues.apache.org/jira/browse/SPARK-21946?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16158176#comment-16158176 ] Apache Spark commented on SPARK-21946: -- User 'kiszk' has created a pull request for this issue: https://github.com/apache/spark/pull/19159 > Flaky test: InMemoryCatalogedDDLSuite.`alter table: rename cached table` > > > Key: SPARK-21946 > URL: https://issues.apache.org/jira/browse/SPARK-21946 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 2.2.0 >Reporter: Dongjoon Hyun >Priority: Minor > > According to the [Apache Spark Jenkins > History|https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql.execution.command/InMemoryCatalogedDDLSuite/alter_table__rename_cached_table/history/] > InMemoryCatalogedDDLSuite.`alter table: rename cached table` is very flaky. > We had better stablize this. > {code} > - alter table: rename cached table !!! CANCELED !!! > Array([2,2], [1,1]) did not equal Array([1,1], [2,2]) bad test: wrong data > (DDLSuite.scala:786) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21726) Check for structural integrity of the plan in QO in test mode
[ https://issues.apache.org/jira/browse/SPARK-21726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21726. - Resolution: Fixed Assignee: Liang-Chi Hsieh Fix Version/s: 2.3.0 > Check for structural integrity of the plan in QO in test mode > - > > Key: SPARK-21726 > URL: https://issues.apache.org/jira/browse/SPARK-21726 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.2.0 >Reporter: Reynold Xin >Assignee: Liang-Chi Hsieh > Fix For: 2.3.0 > > > Right now we don't have any checks in the optimizer to check for the > structural integrity of the plan (e.g. resolved). It would be great if in > test mode, we can check whether a plan is still resolved after the execution > of each rule, so we can catch rules that return invalid plans. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21949) Tables created in unit tests should be dropped after use
[ https://issues.apache.org/jira/browse/SPARK-21949?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-21949. - Resolution: Fixed Assignee: liuxian Fix Version/s: 2.3.0 > Tables created in unit tests should be dropped after use > > > Key: SPARK-21949 > URL: https://issues.apache.org/jira/browse/SPARK-21949 > Project: Spark > Issue Type: Test > Components: Tests >Affects Versions: 2.3.0 >Reporter: liuxian >Assignee: liuxian >Priority: Trivial > Fix For: 2.3.0 > > > Tables should be dropped after use in unit tests. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org