[jira] [Resolved] (SPARK-4264) SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator
[ https://issues.apache.org/jira/browse/SPARK-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4264. --- Resolution: Fixed > SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator > -- > > Key: SPARK-4264 > URL: https://issues.apache.org/jira/browse/SPARK-4264 > Project: Spark > Issue Type: Bug >Affects Versions: 1.2.0 >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Blocker > > This is because it calls hasNext twice, which invokes the completion iterator > twice, unintuitively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4277) Support external shuffle service on Worker
Aaron Davidson created SPARK-4277: - Summary: Support external shuffle service on Worker Key: SPARK-4277 URL: https://issues.apache.org/jira/browse/SPARK-4277 Project: Spark Issue Type: Bug Components: Deploy, Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson It's less useful to have an external shuffle service on the Spark Standalone Worker than on YARN or Mesos (as executor allocations tend to be more static), but it would be good to help test the code path. It would also make Spark more resilient to particular executor failures. Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will take care of cleaning up executor directories, which will mean executors terminate more quickly and that we don't leak data if the executor dies forcefully. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches
[ https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200526#comment-14200526 ] Aaron Davidson commented on SPARK-4238: --- Oh, whoops. > Perform network-level retry of shuffle file fetches > --- > > Key: SPARK-4238 > URL: https://issues.apache.org/jira/browse/SPARK-4238 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Critical > > During periods of high network (or GC) load, it is not uncommon that > IOExceptions crop up around connection failures when fetching shuffle files. > Unfortunately, when such a failure occurs, it is interpreted as an inability > to fetch the files, which causes us to mark the executor as lost and > recompute all of its shuffle outputs. > We should allow retrying at the network level in the event of an IOException > in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4188) Shuffle fetches should be retried at a lower level
[ https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-4188: -- Description: During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. was:Sometimes fetches will fail due to garbage collection pauses or network load. A simple retry could save recomputation of a lot of shuffle data, especially if it's below the task level (i.e., on the level of a single fetch). > Shuffle fetches should be retried at a lower level > -- > > Key: SPARK-4188 > URL: https://issues.apache.org/jira/browse/SPARK-4188 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Aaron Davidson > > During periods of high network (or GC) load, it is not uncommon that > IOExceptions crop up around connection failures when fetching shuffle files. > Unfortunately, when such a failure occurs, it is interpreted as an inability > to fetch the files, which causes us to mark the executor as lost and > recompute all of its shuffle outputs. > We should allow retrying at the network level in the event of an IOException > in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4238) Perform network-level retry of shuffle file fetches
[ https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson closed SPARK-4238. - Resolution: Duplicate > Perform network-level retry of shuffle file fetches > --- > > Key: SPARK-4238 > URL: https://issues.apache.org/jira/browse/SPARK-4238 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Critical > > During periods of high network (or GC) load, it is not uncommon that > IOExceptions crop up around connection failures when fetching shuffle files. > Unfortunately, when such a failure occurs, it is interpreted as an inability > to fetch the files, which causes us to mark the executor as lost and > recompute all of its shuffle outputs. > We should allow retrying at the network level in the event of an IOException > in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200519#comment-14200519 ] Aaron Davidson commented on SPARK-2468: --- [~zzcclp] Use of epoll mode is highly dependent on your environment, and I personally would not recommend it due to known netty bugs which may cause it to be less stable. We have found nio mode to be sufficiently performant in our testing (and netty actually still tries to use epoll if it's available as its selector). [~lianhuiwang] Could you please elaborate on what you mean? > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > Fix For: 1.2.0 > > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199911#comment-14199911 ] Aaron Davidson edited comment on SPARK-2468 at 11/6/14 6:57 AM: This could be due to the netty transfer service allocating more off-heap byte buffers, which perhaps is accounted for differently by YARN. [PR #3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33], which should go in tomorrow, will include a way to avoid allocating off-heap buffers (by setting the spark config "spark.shuffle.io.preferDirectBufs=false"), which should either solve your problem or at least produce the more typical OutOfMemoryError. was (Author: ilikerps): This could be due to the netty transfer service allocating more off-heap byte buffers, which perhaps is accounted for differently by YARN. [PR #3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33], which should go in tomorrow, will include a way to avoid allocating off-heap buffers, which should either solve your problem or at least produce the more typical OutOfMemoryError. > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > Fix For: 1.2.0 > > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2468) Netty-based block server / client module
[ https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199911#comment-14199911 ] Aaron Davidson commented on SPARK-2468: --- This could be due to the netty transfer service allocating more off-heap byte buffers, which perhaps is accounted for differently by YARN. [PR #3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33], which should go in tomorrow, will include a way to avoid allocating off-heap buffers, which should either solve your problem or at least produce the more typical OutOfMemoryError. > Netty-based block server / client module > > > Key: SPARK-2468 > URL: https://issues.apache.org/jira/browse/SPARK-2468 > Project: Spark > Issue Type: Improvement > Components: Shuffle, Spark Core >Reporter: Reynold Xin >Assignee: Reynold Xin >Priority: Critical > Fix For: 1.2.0 > > > Right now shuffle send goes through the block manager. This is inefficient > because it requires loading a block from disk into a kernel buffer, then into > a user space buffer, and then back to a kernel send buffer before it reaches > the NIC. It does multiple copies of the data and context switching between > kernel/user. It also creates unnecessary buffer in the JVM that increases GC > Instead, we should use FileChannel.transferTo, which handles this in the > kernel space with zero-copy. See > http://www.ibm.com/developerworks/library/j-zerocopy/ > One potential solution is to use Netty. Spark already has a Netty based > network module implemented (org.apache.spark.network.netty). However, it > lacks some functionality and is turned off by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4264) SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator
Aaron Davidson created SPARK-4264: - Summary: SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator Key: SPARK-4264 URL: https://issues.apache.org/jira/browse/SPARK-4264 Project: Spark Issue Type: Bug Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Blocker This is because it calls hasNext twice, which invokes the completion iterator twice, unintuitively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4242) Add SASL to external shuffle service
Aaron Davidson created SPARK-4242: - Summary: Add SASL to external shuffle service Key: SPARK-4242 URL: https://issues.apache.org/jira/browse/SPARK-4242 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Assignee: Aaron Davidson It's already added to NettyBlockTransferService, let's just add it to ExternalShuffleClient as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4238) Perform network-level retry of shuffle file fetches
Aaron Davidson created SPARK-4238: - Summary: Perform network-level retry of shuffle file fetches Key: SPARK-4238 URL: https://issues.apache.org/jira/browse/SPARK-4238 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical During periods of high network (or GC) load, it is not uncommon that IOExceptions crop up around connection failures when fetching shuffle files. Unfortunately, when such a failure occurs, it is interpreted as an inability to fetch the files, which causes us to mark the executor as lost and recompute all of its shuffle outputs. We should allow retrying at the network level in the event of an IOException in order to avoid this circumstance. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4236) External shuffle service must cleanup its shuffle files
Aaron Davidson created SPARK-4236: - Summary: External shuffle service must cleanup its shuffle files Key: SPARK-4236 URL: https://issues.apache.org/jira/browse/SPARK-4236 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Critical When external shuffle service is enabled, executors no longer cleanup their own shuffle files, as the external server can continue to serve them. The external serer must clean these files up once it knows that they will never be used again (i.e., when the application terminates or when Spark's ContextCleaner requests that they be deleted). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4163) When fetching blocks unsuccessfully, Web UI only displays "Fetch failure"
[ https://issues.apache.org/jira/browse/SPARK-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4163. --- Resolution: Fixed Fix Version/s: 1.2.0 Assignee: Shixiong Zhu > When fetching blocks unsuccessfully, Web UI only displays "Fetch failure" > - > > Key: SPARK-4163 > URL: https://issues.apache.org/jira/browse/SPARK-4163 > Project: Spark > Issue Type: Improvement > Components: Spark Core, Web UI >Affects Versions: 1.0.0, 1.1.0 >Reporter: Shixiong Zhu >Assignee: Shixiong Zhu >Priority: Minor > Fix For: 1.2.0 > > > When fetching blocks unsuccessfully, Web UI only displays "Fetch failure". > It's hard to find out the cause of the fetch failure. Web UI should display > the stack track for the fetch failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces
Aaron Davidson created SPARK-4198: - Summary: Refactor BlockManager's doPut and doGetLocal into smaller pieces Key: SPARK-4198 URL: https://issues.apache.org/jira/browse/SPARK-4198 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Assignee: Aaron Davidson Currently the BlockManager is pretty hard to grok due to a lot of complicated, interconnected details. Two particularly smelly pieces of code are doPut and doGetLocal, which are both around 200 lines long, with heavily nested statements and lots of orthogonal logic. We should break this up into smaller pieces. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4189) FileSegmentManagedBuffer should have a configurable memory map threshold
Aaron Davidson created SPARK-4189: - Summary: FileSegmentManagedBuffer should have a configurable memory map threshold Key: SPARK-4189 URL: https://issues.apache.org/jira/browse/SPARK-4189 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson One size does not fit all, it would be useful if there was a configuration to change the threshold at which we memory map shuffle files. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4188) Shuffle fetches should be retried at a lower level
Aaron Davidson created SPARK-4188: - Summary: Shuffle fetches should be retried at a lower level Key: SPARK-4188 URL: https://issues.apache.org/jira/browse/SPARK-4188 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Sometimes fetches will fail due to garbage collection pauses or network load. A simple retry could save recomputation of a lot of shuffle data, especially if it's below the task level (i.e., on the level of a single fetch). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4187) External shuffle service should not use Java serializer
Aaron Davidson created SPARK-4187: - Summary: External shuffle service should not use Java serializer Key: SPARK-4187 URL: https://issues.apache.org/jira/browse/SPARK-4187 Project: Spark Issue Type: Improvement Reporter: Aaron Davidson The Java serializer does not make a good solution for talking between JVMs, as the format may change between versions of the code, even if the class itself does not change in an incompatible manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4187) External shuffle service should not use Java serializer
[ https://issues.apache.org/jira/browse/SPARK-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-4187: -- Component/s: Spark Core Affects Version/s: 1.2.0 > External shuffle service should not use Java serializer > --- > > Key: SPARK-4187 > URL: https://issues.apache.org/jira/browse/SPARK-4187 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.2.0 >Reporter: Aaron Davidson > > The Java serializer does not make a good solution for talking between JVMs, > as the format may change between versions of the code, even if the class > itself does not change in an incompatible manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4183) Enable Netty-based BlockTransferService by default
Aaron Davidson created SPARK-4183: - Summary: Enable Netty-based BlockTransferService by default Key: SPARK-4183 URL: https://issues.apache.org/jira/browse/SPARK-4183 Project: Spark Issue Type: Bug Reporter: Aaron Davidson Assignee: Aaron Davidson Spark's NettyBlockTransferService relies on Netty to achieve high performance and zero-copy IO, which is a big simplification over the manual connection management that's done in today's NioBlockTransferService. Additionally, the core functionality of the NettyBlockTransferService has been extracted out of spark core into the "network" package, with the intention of reusing this code for SPARK-3796 ([PR #3001|https://github.com/apache/spark/pull/3001/files#diff-54]). We should turn NettyBlockTransferService on by default in order to improve debuggability and stability of the network layer (which has historically been more challenging with the current BlockTransferService). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4125) Serializer should work on ManagedBuffers as well
[ https://issues.apache.org/jira/browse/SPARK-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187696#comment-14187696 ] Aaron Davidson commented on SPARK-4125: --- cc [~rxin] > Serializer should work on ManagedBuffers as well > > > Key: SPARK-4125 > URL: https://issues.apache.org/jira/browse/SPARK-4125 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Aaron Davidson > > As an API nice-to-have, we should probably try to make ManagedBuffers a > first-class citizen throughout Spark to minimize the allocation of > unnecessary buffers. As part of this, we should make Serializer accept > ManagedBuffer instead of just ByteBuffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4125) Serializer should work on ManagedBuffers as well
Aaron Davidson created SPARK-4125: - Summary: Serializer should work on ManagedBuffers as well Key: SPARK-4125 URL: https://issues.apache.org/jira/browse/SPARK-4125 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson As an API nice-to-have, we should probably try to make ManagedBuffers a first-class citizen throughout Spark to minimize the allocation of unnecessary buffers. As part of this, we should make Serializer accept ManagedBuffer instead of just ByteBuffer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4084) Reuse sort key in Sorter
[ https://issues.apache.org/jira/browse/SPARK-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-4084. --- Resolution: Fixed Fix Version/s: 1.2.0 > Reuse sort key in Sorter > > > Key: SPARK-4084 > URL: https://issues.apache.org/jira/browse/SPARK-4084 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.2.0 > > > Sorter uses generic-typed key for sorting. When data is large, it creates > lots of key objects, which is not efficient. We should reuse the key in > Sorter for memory efficiency. This change is part of the petabyte sort > implementation from [~rxin]. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-4106) Shuffle write and spill to disk metrics are not incorrect
Aaron Davidson created SPARK-4106: - Summary: Shuffle write and spill to disk metrics are not incorrect Key: SPARK-4106 URL: https://issues.apache.org/jira/browse/SPARK-4106 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Critical I have an encountered a job which has some disk spilled (memory) but the disk spilled (disk) is 0, as well as the shuffle write. If I switch to hash based shuffle, where there happens to be no disk spilling, then the shuffle write is correct. I can get more info on a workload to repro this situation, but perhaps that state of events is sufficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3994) countByKey / countByValue do not go through Aggregator
Aaron Davidson created SPARK-3994: - Summary: countByKey / countByValue do not go through Aggregator Key: SPARK-3994 URL: https://issues.apache.org/jira/browse/SPARK-3994 Project: Spark Issue Type: Bug Reporter: Aaron Davidson The implementations of these methods are historical remnants of Spark from a time when the shuffle may have been nonexistent. Now, they can be simplified by plugging into reduceByKey(), potentially seeing performance and stability improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3994) countByKey / countByValue do not go through Aggregator
[ https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson reassigned SPARK-3994: - Assignee: Aaron Davidson > countByKey / countByValue do not go through Aggregator > -- > > Key: SPARK-3994 > URL: https://issues.apache.org/jira/browse/SPARK-3994 > Project: Spark > Issue Type: Bug >Reporter: Aaron Davidson >Assignee: Aaron Davidson > > The implementations of these methods are historical remnants of Spark from a > time when the shuffle may have been nonexistent. Now, they can be simplified > by plugging into reduceByKey(), potentially seeing performance and stability > improvements. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3950) Completed time is blank for some successful tasks
[ https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171563#comment-14171563 ] Aaron Davidson commented on SPARK-3950: --- [~andrewor14] > Completed time is blank for some successful tasks > - > > Key: SPARK-3950 > URL: https://issues.apache.org/jira/browse/SPARK-3950 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Aaron Davidson > > In the Spark web UI, some tasks appear to have a blank Duration column. It's > possible that these ran for <.5 seconds, but if so, we should use > milliseconds like we do for GC time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3950) Completed time is blank for some successful tasks
Aaron Davidson created SPARK-3950: - Summary: Completed time is blank for some successful tasks Key: SPARK-3950 URL: https://issues.apache.org/jira/browse/SPARK-3950 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.1.1 Reporter: Aaron Davidson In the Spark web UI, some tasks appear to have a blank Duration column. It's possible that these ran for <.5 seconds, but if so, we should use milliseconds like we do for GC time. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3923) All Standalone Mode services time out with each other
[ https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169044#comment-14169044 ] Aaron Davidson commented on SPARK-3923: --- I did a little digging hoping to find some post about this, no particular luck. I did find [this post|https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs] which recommends using an interval time < pause, which we are not doing. This doesn't seem to explain the services all timing out after the heartbeat interval time (which is currently 1000 seconds), but may be good to know in the future. > All Standalone Mode services time out with each other > - > > Key: SPARK-3923 > URL: https://issues.apache.org/jira/browse/SPARK-3923 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.2.0 >Reporter: Aaron Davidson >Priority: Blocker > > I'm seeing an issue where it seems that components in Standalone Mode > (Worker, Master, Driver, and Executor) all seem to time out with each other > after around 1000 seconds. Here is an example log: > {code} > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM > 14/10/13 06:43:55 INFO Master: Registering worker > ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM > 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell > 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID > app-20141013064356- > ... precisely 1000 seconds later ... > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got > disassociated, removing it. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] > was not delivered. [2] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 INFO Master: Removing worker > worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on > ip-10-0-175-214.us-west-2.compute.internal:42918 > 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 > 14/10/13 07:00:35 INFO Master: > akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got > disassociated, removing it. > 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote > system > [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has > failed, address is now gated for [5000] ms. Reason is: [Disassociated]. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [3] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:35 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$Disassociated] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] > was not delivered. [4] dead letters encountered. This logging can be turned > off or adjusted with configuration settings 'akka.log-dead-letters' and > 'akka.log-dead-letters-during-shutdown'. > 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake > timed out or transport failure detector triggered. > 14/10/13 07:00:36 INFO Master: > akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got > disassociated, removing it. > 14/10/13 07:00:36 INFO LocalActorRef: Message > [akka.remote.transport.AssociationHandle$InboundPayload] from > Actor[akka://sparkMaster/deadLetters] to > Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#194
[jira] [Created] (SPARK-3923) All Standalone Mode services time out with each other
Aaron Davidson created SPARK-3923: - Summary: All Standalone Mode services time out with each other Key: SPARK-3923 URL: https://issues.apache.org/jira/browse/SPARK-3923 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.2.0 Reporter: Aaron Davidson Priority: Blocker I'm seeing an issue where it seems that components in Standalone Mode (Worker, Master, Driver, and Executor) all seem to time out with each other after around 1000 seconds. Here is an example log: {code} 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM 14/10/13 06:43:55 INFO Master: Registering worker ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID app-20141013064356- ... precisely 1000 seconds later ... 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got disassociated, removing it. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245] was not delivered. [2] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 INFO Master: Removing worker worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on ip-10-0-175-214.us-west-2.compute.internal:42918 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1 14/10/13 07:00:35 INFO Master: akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got disassociated, removing it. 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [3] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:35 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$Disassociated] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324] was not delivered. [4] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 14/10/13 07:00:36 INFO Master: akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got disassociated, removing it. 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.AssociationHandle$InboundPayload] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249] was not delivered. [5] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 14/10/13 07:00:36 INFO Master: Removing app app-20141013064356- 14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259] has failed, address is now gated for [5000] ms. Reason is: [Disassociated]. 14/10/13 07:00:36 INFO LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2
[jira] [Updated] (SPARK-3921) WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl
[ https://issues.apache.org/jira/browse/SPARK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-3921: -- Description: As of [this commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153], standalone mode appears to have lost its WorkerWatcher, because of the swapped workerUrl and appId parameters. We still put workerUrl before appId when we start standalone executors, and the Executor misinterprets the appId as the workerUrl and fails to create the WorkerWatcher. Note that this does not seem to crash the Standalone executor mode, despite the failing of the WorkerWatcher during its constructor. was:As of [this commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153], standalone mode appears to be broken, because of the swapped workerUrl and appId parameters. We still put workerUrl before appId when we start standalone executors, and the Executor misinterprets the appId as the workerUrl and fails to create the WorkerWatcher. Summary: WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl (was: Executors in Standalone mode fail to come up due to invalid workerUrl) > WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl > - > > Key: SPARK-3921 > URL: https://issues.apache.org/jira/browse/SPARK-3921 > Project: Spark > Issue Type: Bug >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Critical > > As of [this > commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153], > standalone mode appears to have lost its WorkerWatcher, because of the > swapped workerUrl and appId parameters. We still put workerUrl before appId > when we start standalone executors, and the Executor misinterprets the appId > as the workerUrl and fails to create the WorkerWatcher. > Note that this does not seem to crash the Standalone executor mode, despite > the failing of the WorkerWatcher during its constructor. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3921) Executors in Standalone mode fail to come up due to invalid workerUrl
Aaron Davidson created SPARK-3921: - Summary: Executors in Standalone mode fail to come up due to invalid workerUrl Key: SPARK-3921 URL: https://issues.apache.org/jira/browse/SPARK-3921 Project: Spark Issue Type: Bug Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical As of [this commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153], standalone mode appears to be broken, because of the swapped workerUrl and appId parameters. We still put workerUrl before appId when we start standalone executors, and the Executor misinterprets the appId as the workerUrl and fails to create the WorkerWatcher. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167647#comment-14167647 ] Aaron Davidson commented on SPARK-3889: --- Sorry, it was not linked: https://github.com/apache/spark/pull/2742 > JVM dies with SIGBUS, resulting in ConnectionManager failed ACK > --- > > Key: SPARK-3889 > URL: https://issues.apache.org/jira/browse/SPARK-3889 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Critical > Fix For: 1.2.0 > > > Here's the first part of the core dump, possibly caused by a job which > shuffles a lot of very small partitions. > {code} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 > # > # JRE version: 7.0_25-b30 > # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # v ~StubRoutines::jbyte_disjoint_arraycopy > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # If you would like to submit a bug report, please include > # instructions on how to reproduce the bug and visit: > # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ > # > --- T H R E A D --- > Current thread (0x7fa4b0631000): JavaThread "Executor task launch > worker-170" daemon [_thread_in_Java, id=6783, > stack(0x7fa4448ef000,0x7fa4449f)] > siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), > si_addr=0x7fa428f79000 > {code} > Here is the only useful content I can find related to JVM and SIGBUS from > Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664 > It appears it may be related to disposing byte buffers, which we do in the > ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of > them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
[ https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-3889: -- Description: Here's the first part of the core dump, possibly caused by a job which shuffles a lot of very small partitions. {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread "Executor task launch worker-170" daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. was: Here's the first part of the core dump: {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread "Executor task launch worker-170" daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. > JVM dies with SIGBUS, resulting in ConnectionManager failed ACK > --- > > Key: SPARK-3889 > URL: https://issues.apache.org/jira/browse/SPARK-3889 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Aaron Davidson >Assignee: Aaron Davidson >Priority: Critical > > Here's the first part of the core dump, possibly caused by a job which > shuffles a lot of very small partitions. > {code} > # > # A fatal error has been detected by the Java Runtime Environment: > # > # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 > # > # JRE version: 7.0_25-b30 > # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 > compressed oops) > # Problematic frame: > # v ~StubRoutines::jbyte_disjoint_arraycopy > # > # Failed to write core dump. Core dumps have been disabled. To enable core > dumping, try "ulimit -c unlimited" before starting Java again > # > # If you would like to submit a bug report, please include > # instructions on how to reproduce the bug and visit: > # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ > # > --- T H R E A D --- > Current thread (0x7fa4b0631000): JavaThread "Executor task launch > worker-170" daemon [_thread_in_Java, id=6783, > stack(0x7fa4448ef000,0x7fa4449f)] > siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), > si_addr=0x7fa428f79000 > {code} > Here is the only useful content I can find related to JVM and SIGBUS from > Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664 > It appears it may be related to disposing byte buffers, which we do in the > ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of > them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
Aaron Davidson created SPARK-3889: - Summary: JVM dies with SIGBUS, resulting in ConnectionManager failed ACK Key: SPARK-3889 URL: https://issues.apache.org/jira/browse/SPARK-3889 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.1.0 Reporter: Aaron Davidson Assignee: Aaron Davidson Priority: Critical Here's the first part of the core dump: {code} # # A fatal error has been detected by the Java Runtime Environment: # # SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704 # # JRE version: 7.0_25-b30 # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed oops) # Problematic frame: # v ~StubRoutines::jbyte_disjoint_arraycopy # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please include # instructions on how to reproduce the bug and visit: # https://bugs.launchpad.net/ubuntu/+source/openjdk-7/ # --- T H R E A D --- Current thread (0x7fa4b0631000): JavaThread "Executor task launch worker-170" daemon [_thread_in_Java, id=6783, stack(0x7fa4448ef000,0x7fa4449f)] siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), si_addr=0x7fa428f79000 {code} Here is the only useful content I can find related to JVM and SIGBUS from Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664 It appears it may be related to disposing byte buffers, which we do in the ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of them in BufferMessage. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3796) Create shuffle service for external block storage
[ https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-3796: -- Description: This task will be broken up into two parts -- the first, being to refactor our internal shuffle service to use a BlockTransferService which we can easily extract out into its own service, and then the second is to actually do the extraction. Here is the design document for the low-level service, nicknamed "Sluice", on top of which will be Spark's BlockTransferService API: https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0 > Create shuffle service for external block storage > - > > Key: SPARK-3796 > URL: https://issues.apache.org/jira/browse/SPARK-3796 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Reporter: Patrick Wendell >Assignee: Aaron Davidson > > This task will be broken up into two parts -- the first, being to refactor > our internal shuffle service to use a BlockTransferService which we can > easily extract out into its own service, and then the second is to actually > do the extraction. > Here is the design document for the low-level service, nicknamed "Sluice", on > top of which will be Spark's BlockTransferService API: > https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default
[ https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164485#comment-14164485 ] Aaron Davidson commented on SPARK-3805: --- Upon further thought, I also agree with Patrick. My initial inclination was to go back to the original behavior after the issue was resolved, but it seems more dangerous to delete user data than to not delete it. A user who sees an accumulation of state will look for a way to remove it (and find this feature), but any user who needs to find now-deleted data is going to be irreversibly sad. > Enable Standalone worker cleanup by default > --- > > Key: SPARK-3805 > URL: https://issues.apache.org/jira/browse/SPARK-3805 > Project: Spark > Issue Type: Task > Components: Deploy >Reporter: Andrew Ash > > Now that SPARK-1860 is fixed we should be able to turn on > {{spark.worker.cleanup.enabled}} by default -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159343#comment-14159343 ] Aaron Davidson commented on SPARK-1860: --- Agreed, that sounds good. Would you or [~mccheah] be able to create a quick PR for this? > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1860. --- Resolution: Fixed Fixed by mccheah in https://github.com/apache/spark/pull/2609 > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155303#comment-14155303 ] Aaron Davidson commented on SPARK-1860: --- The Worker itself is solely a Standalone mode construct, so AFAIK, this is not an issue. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154228#comment-14154228 ] Aaron Davidson commented on SPARK-1860: --- Your logic SGTM, but I would add one additional check to avoid deleting the directory for an Application which still has running Executors on that node, just to make absolutely sure that we don't delete app directories that just happen to sit idle for a while. This check can be performed by iterating over the "executors" map in Worker.scala and matching the appId with the app directory's name. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152839#comment-14152839 ] Aaron Davidson commented on SPARK-1860: --- The Executor could clean up its own jars when it terminates normally, that seems fine. The impact of this seems limited, though, and it's a good idea to limit the scope of shutdown hooks as much as possible. There are three classes of things to delete: 1. Shuffle files / block manager blocks -- large -- deleted by graceful Executor termination. Can be deleted immediately. 2. Uploaded jars / files -- usually small -- deleted by Worker cleanup. Can be deleted immediately. 3. Logs -- small to medium -- deleted by Worker cleanup. Should not be deleted immediately. Number 1 is most critical in terms of impact on the system. Numbers 2 and 3 are of the same order of magnitude in size, so cleaning up 2 and not 3 is not expected to improve the system's stability by more than a factor of ~2x applications. Note that the intentions of this particular JIRA are very simple: cleanup 2 and 3 for all executors several days after they have terminated, rather than after they have started. If you wish to expand the scope of the Worker or Executor cleanup, that should be covered in a separate JIRA (which is welcome -- I just want to make sure we're on the same page about this particular issue!). > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152744#comment-14152744 ] Aaron Davidson commented on SPARK-1860: --- Note that there are two separate forms of cleanup: application data cleanup (jars and logs) and shuffle data cleanup. Standalone Worker cleanup deals with the former, Executor termination handlers deal with the latter. The purpose is not to deal with executors that have terminated ungracefully, but to actually clean up old application directories. Here the idea is that a Worker may be running for a very long time (weeks, months) and over time accumulates hundreds of application directories. We want to delete these directories after several days of them being terminated (today we'll clean them up whether or not they're terminated, which loses their jars and logs), after which we presumably don't care anymore. We do not want to clean them up immediately after application termination. The Worker performing shuffle data cleanup for ungracefully terminated Executors is not a bad idea, but is a (smallish) feature onto itself, as the Worker does not currently know where a particular Executor is storing its data. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Blocker > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-2973: -- Assignee: Michael Armbrust (was: Cheng Lian) > Add a way to show tables without executing a job > > > Key: SPARK-2973 > URL: https://issues.apache.org/jira/browse/SPARK-2973 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Aaron Davidson >Assignee: Michael Armbrust >Priority: Critical > Fix For: 1.2.0 > > > Right now, sql("show tables").collect() will start a Spark job which shows up > in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-2973) Add a way to show tables without executing a job
[ https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson reopened SPARK-2973: --- Reopening this because sql("show tables").take(1) still starts a job. > Add a way to show tables without executing a job > > > Key: SPARK-2973 > URL: https://issues.apache.org/jira/browse/SPARK-2973 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Aaron Davidson >Assignee: Cheng Lian >Priority: Critical > Fix For: 1.2.0 > > > Right now, sql("show tables").collect() will start a Spark job which shows up > in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
[ https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146039#comment-14146039 ] Aaron Davidson commented on SPARK-3267: --- I don't have it anymore, unfortunately. Michael and I did a little digging at the time, and I think we found the reason for the deadlock, shown in the stack traces above, but decided it was a very unlikely scenario. Indeed, the query did not consistently deadlock; this only occurred a single time. > Deadlock between ScalaReflectionLock and Data type initialization > - > > Key: SPARK-3267 > URL: https://issues.apache.org/jira/browse/SPARK-3267 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.1.0 >Reporter: Aaron Davidson >Priority: Critical > > Deadlock here: > {code} > "Executor task launch worker-0" daemon prio=10 tid=0x7fab50036000 > nid=0x27a in Object.wait() [0x7fab60c2e000 > ] >java.lang.Thread.State: RUNNABLE > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:202) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:195) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 > 93) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal > a:175) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:304) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:195) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 > 93) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:314) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:195) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 > 93) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:313) > at > org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal > a:195) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) > at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) > ... > {code} > and > {code} > "Executor task launch worker-2" daemon prio=10 tid=0x7fab100f0800 > nid=0x27e in Object.wait() [0x7fab0eeec000 > ] >java.lang.Thread.State: RUNNABLE > at > org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) > - locked <0x00064e5d9a48> (a > org.apache.spark.sql.catalyst.expressions.Cast) > at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) > at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) > at > org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. > scala:139) > at > org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. > scala:139) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > scala.collection.TraversableLike$class.map(TraversableLike.scala:244) > at scala.collection.AbstractTraversable.map(Traversable.scala:105) > at > org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) > at > org.apache.spark.sql.parquet.ParquetTableScan$$anonfu
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143792#comment-14143792 ] Aaron Davidson commented on SPARK-3032: --- [~matei] any thoughts on this issue? > Potential bug when running sort-based shuffle with sorting using TimSort > > > Key: SPARK-3032 > URL: https://issues.apache.org/jira/browse/SPARK-3032 > Project: Spark > Issue Type: Bug > Components: Shuffle >Affects Versions: 1.1.0 >Reporter: Saisai Shao > > When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, > data type for key and value is (String, String), always meet this issue: > {noformat} > java.lang.IllegalArgumentException: Comparison method violates its general > contract! > at > org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) > at > org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) > at > org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) > at > org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) > at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) > at > org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) > at > org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) > at > org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) > at > org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) > at > org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) > at > org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > at org.apache.spark.scheduler.Task.run(Task.scala:54) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) > at java.lang.Thread.run(Thread.java:722) > {noformat} > Seems the current partitionKeyComparator which use hashcode of String as key > comparator break some sorting contracts. > Also I tested using data type Int as key, this is OK to pass the test, since > hashcode of Int is its self. So I think potentially partitionDiff + hashcode > of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117903#comment-14117903 ] Aaron Davidson commented on SPARK-: --- Anyone have a jmap on the driver during this time? Perhaps also a GC log? To be clear, a "jmap -histo:live" and a "jmap -histo" should be sufficient over a full heap dump. I wonder if something changed with the MapOutputTracker, since that state grows with O(M * R), which is extremely high in this situation. > Large number of partitions causes OOM > - > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 1.1.0 > Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances >Reporter: Nicholas Chammas > > Here’s a repro for PySpark: > {code} > a = sc.parallelize(["Nick", "John", "Bob"]) > a = a.repartition(24000) > a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > {code} > This code runs fine on 1.0.2. It returns the following result in just over a > minute: > {code} > [(4, 'NickJohn')] > {code} > However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it > runs for a very, very long time (at least > 45 min) and then fails with > {{java.lang.OutOfMemoryError: Java heap space}}. > Here is a stack trace taken from a run on 1.1.0-rc2: > {code} > >>> a = sc.parallelize(["Nick", "John", "Bob"]) > >>> a = a.repartition(24000) > >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1) > 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent > heart beats: 175143ms exceeds 45000ms > 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent > heart beats: 175359ms exceeds 45000ms > 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent > heart beats: 173061ms exceeds 45000ms > 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent > heart beats: 176816ms exceeds 45000ms > 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent > heart beats: 182241ms exceeds 45000ms > 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager > BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent > heart beats: 178406ms exceeds 45000ms > 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver > thread-3 > java.lang.OutOfMemoryError: Java heap space > at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35) > at > com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18) > at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699) > at > com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611) > at > com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221) > at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729) > at > org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162) > at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79) > at > org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514) > at > org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47) > at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) > at > org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR > SendingConnection: Exception while reading SendingConnection to > ConnectionManagerId(ip-10-73-142-
[jira] [Created] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization
Aaron Davidson created SPARK-3267: - Summary: Deadlock between ScalaReflectionLock and Data type initialization Key: SPARK-3267 URL: https://issues.apache.org/jira/browse/SPARK-3267 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Aaron Davidson Deadlock here: {code} "Executor task launch worker-0" daemon prio=10 tid=0x7fab50036000 nid=0x27a in Object.wait() [0x7fab60c2e000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:202) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal a:175) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:304) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:314) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4 93) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:313) at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal a:195) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218) at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214) ... {code} and {code} "Executor task launch worker-2" daemon prio=10 tid=0x7fab100f0800 nid=0x27e in Object.wait() [0x7fab0eeec000 ] java.lang.Thread.State: RUNNABLE at org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250) - locked <0x00064e5d9a48> (a org.apache.spark.sql.catalyst.expressions.Cast) at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247) at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations. scala:139) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139) at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126) at org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) at org.apache.spark.rdd.RDD.iterator(RDD.scala:229) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Ta
[jira] [Created] (SPARK-3236) Reading Parquet tables from Metastore mangles location
Aaron Davidson created SPARK-3236: - Summary: Reading Parquet tables from Metastore mangles location Key: SPARK-3236 URL: https://issues.apache.org/jira/browse/SPARK-3236 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Davidson Assignee: Aaron Davidson Currently we do "relation.hiveQlTable.getDataLocation.getPath", which returns the path-part of the URI (e.g., s3n://my-bucket/my-path => /my-path). We should do "relation.hiveQlTable.getDataLocation.toString" instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3093) masterLock in Worker is no longer need
[ https://issues.apache.org/jira/browse/SPARK-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-3093. --- Resolution: Fixed Assignee: Chen Chao Target Version/s: 1.2.0 > masterLock in Worker is no longer need > -- > > Key: SPARK-3093 > URL: https://issues.apache.org/jira/browse/SPARK-3093 > Project: Spark > Issue Type: Improvement >Reporter: Chen Chao >Assignee: Chen Chao > > there's no need to use masterLock in Worker now since all communications are > within Akka actor -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3029) Disable local execution of Spark jobs by default
Aaron Davidson created SPARK-3029: - Summary: Disable local execution of Spark jobs by default Key: SPARK-3029 URL: https://issues.apache.org/jira/browse/SPARK-3029 Project: Spark Issue Type: Improvement Reporter: Aaron Davidson Assignee: Aaron Davidson Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead. Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job
Aaron Davidson created SPARK-2973: - Summary: Add a way to show tables without executing a job Key: SPARK-2973 URL: https://issues.apache.org/jira/browse/SPARK-2973 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, sql("show tables").collect() will start a Spark job which shows up in the UI. There should be a way to get these without this step. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2936) Migrate Netty network module from Java to Scala
[ https://issues.apache.org/jira/browse/SPARK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2936. --- Resolution: Fixed > Migrate Netty network module from Java to Scala > --- > > Key: SPARK-2936 > URL: https://issues.apache.org/jira/browse/SPARK-2936 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 1.1.0 >Reporter: Reynold Xin >Assignee: Reynold Xin > > The netty network module was originally written when Scala 2.9.x had a bug > that prevents a pure Scala implementation, and a subset of the files were > done in Java. We have since upgraded to Scala 2.10, and can migrate all Java > files now to Scala. > https://github.com/netty/netty/issues/781 > https://github.com/mesos/spark/pull/522 -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2949) SparkContext does not fate-share with ActorSystem
Aaron Davidson created SPARK-2949: - Summary: SparkContext does not fate-share with ActorSystem Key: SPARK-2949 URL: https://issues.apache.org/jira/browse/SPARK-2949 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Aaron Davidson It appears that an uncaught fatal error in Spark's Driver ActorSystem does not cause the SparkContext to terminate. We observed an issue in production that caused a PermGen error, but it just kept throwing this error: {code} 14/08/09 15:07:24 ERROR ActorSystemImpl: Uncaught fatal error from thread [spark-akka.actor.default-dispatcher-26] shutting down ActorSystem [spark] java.lang.OutOfMemoryError: PermGen space {code} We should probably do something similar for what we did in the DAGSCheduler and ensure that we call SparkContext#stop() if the entire ActorSystem dies with a fatal error. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2825) Allow creating external tables in metastore
Aaron Davidson created SPARK-2825: - Summary: Allow creating external tables in metastore Key: SPARK-2825 URL: https://issues.apache.org/jira/browse/SPARK-2825 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Davidson External tables are useful for creating a metastore entry that points at pre-existing data. There is an easy workaround to use registerAsTable within SparkSQL itself, but this is only transient, and has to be re-run for every Spark session. -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-2824) Allow saving Parquet files to the HiveMetastore
Aaron Davidson created SPARK-2824: - Summary: Allow saving Parquet files to the HiveMetastore Key: SPARK-2824 URL: https://issues.apache.org/jira/browse/SPARK-2824 Project: Spark Issue Type: Bug Components: SQL Reporter: Aaron Davidson Currently, we use one code path for reading all data from the Hive Metastore, and this precludes writing or loading data as custom ParquetRelations (they are always MetastoreRelations). -- This message was sent by Atlassian JIRA (v6.2#6252) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures
[ https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2557. --- Resolution: Fixed Fix Version/s: 1.1.0 > createTaskScheduler should be consistent between local and local-n-failures > > > Key: SPARK-2557 > URL: https://issues.apache.org/jira/browse/SPARK-2557 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Ye Xianjin >Priority: Minor > Labels: starter > Fix For: 1.1.0 > > Original Estimate: 2h > Remaining Estimate: 2h > > In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to > estimates the number of cores on the machine. I think we should also be able > to use * in the local-n-failures mode. > And according to the code in the LOCAL_N_REGEX pattern matching code, I > believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX > should be > {code} > """local\[([0-9]+|\*)\]""".r > {code} > rather than > {code} > """local\[([0-9\*]+)\]""".r > {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1860: -- Summary: Standalone Worker cleanup should not clean up running executors (was: Standalone Worker cleanup should not clean up running applications) > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1860: -- Description: The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any executors that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Executor's log/data folders should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. was: The default values of the standalone worker cleanup code cleanup all application data every 7 days. This includes jars that were added to any applications that happen to be running for longer than 7 days, hitting streaming jobs especially hard. Applications should not be cleaned up if they're still running. Until then, this behavior should not be enabled by default. > Standalone Worker cleanup should not clean up running executors > --- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > executors that happen to be running for longer than 7 days, hitting streaming > jobs especially hard. > Executor's log/data folders should not be cleaned up if they're still > running. Until then, this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications
[ https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076403#comment-14076403 ] Aaron Davidson commented on SPARK-1860: --- There's not an easy way to tell if an application is still running. However, the Worker has state about which executors are still running. This is really what I intended originally -- we must not clean up an Executor's own state from underneath it. I will change the title to reflect this intention. > Standalone Worker cleanup should not clean up running applications > -- > > Key: SPARK-1860 > URL: https://issues.apache.org/jira/browse/SPARK-1860 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Priority: Critical > Fix For: 1.1.0 > > > The default values of the standalone worker cleanup code cleanup all > application data every 7 days. This includes jars that were added to any > applications that happen to be running for longer than 7 days, hitting > streaming jobs especially hard. > Applications should not be cleaned up if they're still running. Until then, > this behavior should not be enabled by default. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075744#comment-14075744 ] Aaron Davidson commented on SPARK-2707: --- That doesn't sound like a bad idea -- actually sounds significantly more straightforward than depending on a version of Akka that only shades the internal protobuf usage. > Upgrade to Akka 2.3 > --- > > Key: SPARK-2707 > URL: https://issues.apache.org/jira/browse/SPARK-2707 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Yardena > > Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray > features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075652#comment-14075652 ] Aaron Davidson commented on SPARK-2707: --- It does sound mostly mechanical and I believe we don't use most of those features. Perhaps just getting it to compile (while re-shading protobuf) would be sufficient to make it work. > Upgrade to Akka 2.3 > --- > > Key: SPARK-2707 > URL: https://issues.apache.org/jira/browse/SPARK-2707 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Yardena > > Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray > features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3
[ https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075645#comment-14075645 ] Aaron Davidson commented on SPARK-2707: --- Are there any anticipated issues in upgrading from 2.2 to 2.3? Are there source or behaviorally incompatible changes? > Upgrade to Akka 2.3 > --- > > Key: SPARK-2707 > URL: https://issues.apache.org/jira/browse/SPARK-2707 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Yardena > > Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray > features directly in the same project. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-1264) Documentation for setting heap sizes across all configurations
[ https://issues.apache.org/jira/browse/SPARK-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-1264: -- Assignee: (was: Aaron Davidson) > Documentation for setting heap sizes across all configurations > -- > > Key: SPARK-1264 > URL: https://issues.apache.org/jira/browse/SPARK-1264 > Project: Spark > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > As a user, there are lots of places to configure heap sizes, and it takes a > bit of trial and error to figure out how to configure what you want. > We need some more clear documentation on how set these for the cross product > of Spark components (master, worker, executor, driver, shell) and deployment > modes (Standalone, YARN, Mesos, EC2?). > I'm happy to do the authoring if someone can help pull together the relevant > details. > Here's the best I've got so far: > {noformat} > # Standalone cluster > Master - SPARK_DAEMON_MEMORY - default: 512mb > Worker - SPARK_DAEMON_MEMORY vs SPARK_WORKER_MEMORY? - default: ? See > WorkerArguments.inferDefaultMemory() > Executor - spark.executor.memory > Driver - SPARK_DRIVER_MEMORY - default: 512mb > Shell - A pre-built driver so SPARK_DRIVER_MEMORY - default: 512mb > # EC2 cluster > Master - ? > Worker - ? > Executor - ? > Driver - ? > Shell - ? > # Mesos cluster > Master - SPARK_DAEMON_MEMORY > Worker - SPARK_DAEMON_MEMORY > Executor - SPARK_EXECUTOR_MEMORY > Driver - SPARK_DRIVER_MEMORY > Shell - A pre-built driver so SPARK_DRIVER_MEMORY > # YARN cluster > Master - SPARK_MASTER_MEMORY ? > Worker - SPARK_WORKER_MEMORY ? > Executor - SPARK_EXECUTOR_MEMORY > Driver - SPARK_DRIVER_MEMORY > Shell - A pre-built driver so SPARK_DRIVER_MEMORY > {noformat} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2660) Enable pretty-printing SchemaRDD Rows
Aaron Davidson created SPARK-2660: - Summary: Enable pretty-printing SchemaRDD Rows Key: SPARK-2660 URL: https://issues.apache.org/jira/browse/SPARK-2660 Project: Spark Issue Type: Improvement Components: SQL Reporter: Aaron Davidson Right now, "printing a Row" results in something like "[a,b,c]". It would be cool if there was a way to pretty-print them similar to JSON, into something like {code} { Col1 : a, Col2 : b, Col3 : c } {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071192#comment-14071192 ] Aaron Davidson commented on SPARK-2282: --- [~pwendell] That would in general be the right solution, but this particular change hasn't been merged yet (referring to the second PR on this bug, which is a more complete fix). > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070528#comment-14070528 ] Aaron Davidson commented on SPARK-2282: --- Great to hear! These files haven't been changed since the 1.0.1 release besides this patch, so it should be fine to just drop them in. (A generally safer option would be to do a git merge, though, against Spark's refs/pull/1503/head branch.) > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks
[ https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997761#comment-13997761 ] Aaron Davidson edited comment on SPARK-1767 at 7/21/14 7:46 PM: -One simple workaround to this is to just make sure that partitions that are in memory are ordered first in the list of partitions, as Spark will try to place executors based on the order in this list.- This is, of course, not a complete solution, as we would not utilize the locality-wait logic within Spark and would immediately fallback to a non-cached node if the cached node was busy, rather than waiting for some period of time for the cached node to become available. Edit: I was wrong about how Spark schedules partitions -- ordering is not sufficient. was (Author: ilikerps): One simple workaround to this is to just make sure that partitions that are in memory are ordered first in the list of partitions, as Spark will try to place executors based on the order in this list. This is, of course, not a complete solution, as we would not utilize the locality-wait logic within Spark and would immediately fallback to a non-cached node if the cached node was busy, rather than waiting for some period of time for the cached node to become available. > Prefer HDFS-cached replicas when scheduling data-local tasks > > > Key: SPARK-1767 > URL: https://issues.apache.org/jira/browse/SPARK-1767 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068083#comment-14068083 ] Aaron Davidson commented on SPARK-2282: --- Hey Ken, I created [PR 1503|https://github.com/apache/spark/pull/1503] to implement the solution I mentioned. It would be great if you could try testing out this patch on your cluster. While testing, I noticed that the ephemeral ports were still growing with number of tasks due to how we launch new tasks on the PySpark daemon. However, this should only affect workers, and the rate of buildup should be divided by the number of workers. In other words, it should only ever be a problem on a very small cluster. > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065306#comment-14065306 ] Aaron Davidson commented on SPARK-2282: --- This problem is kinda silly because we're accumulating these updates from a single thread in the DAGScheduler, so we should only really have one socket open at a time, but it's very short lived. We could just reuse the connection with a relatively minor refactor of accumulators.py and PythonAccumulatorParam. > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065121#comment-14065121 ] Aaron Davidson commented on SPARK-2282: --- This problem does look identical. I think I gave you the wrong netstat command, as "-l" only show listening sockets. Try with "-a" instead to see all open connections to confirm this, but the rest of your symptoms align perfectly. I did a little Googling around for your specific kernel version, and it turns out [someone else|http://lists.openwall.net/netdev/2011/07/13/39] has had success with tcp_tw_recycle on 2.6.32. Could you try to make absolutely sure that the sysctl is taking effect? Perhaps you can try adding "net.ipv4.tcp_tw_recycle = 1" to /etc/sysctl.conf and then running a "sysctl -p" before restarting pyspark. > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in
Aaron Davidson created SPARK-2545: - Summary: Add a diagnosis mode for closures to figure out what they're bringing in Key: SPARK-2545 URL: https://issues.apache.org/jira/browse/SPARK-2545 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Aaron Davidson Today, it's pretty hard to figure out why your closure is bigger than expected, because it's not obvious what objects are being included or who is including them. We should have some sort of diagnosis available to users with very large closures that displays the contents of the closure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly
[ https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063211#comment-14063211 ] Aaron Davidson commented on SPARK-2282: --- This should actually only be necessary on the master. Use of the SO_REUSEADDR property (equivalently, sysctl tcp_tw_reuse) means that the number of used sockets will increase to the maximum number of ephemeral ports, but then should remain constant. It's possible that if another process tries to allocate an ephemeral port during this time, it will fail. While tcp_tw_reuse is generally considered "safe", setting *tcp_tw_recycle* can lead to unexpected packet arrival from closed streams (though it's very unlikely), but is a more guaranteed solution. This should cause the connections to be recycled immediately after the TCP teardown, and thus no buildup of sockets should occur. Please let me know if setting either of these parameters helps on the driver machine. You can also verify that this problem is occurring by doing a {{netstat -lpn}} during execution, iirc, which should display an inordinate number of open connections on the Spark Driver process and on a Python daemon one. > PySpark crashes if too many tasks complete quickly > -- > > Key: SPARK-2282 > URL: https://issues.apache.org/jira/browse/SPARK-2282 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 0.9.1, 1.0.0, 1.0.1 >Reporter: Aaron Davidson >Assignee: Aaron Davidson > Fix For: 0.9.2, 1.0.0, 1.0.1 > > > Upon every task completion, PythonAccumulatorParam constructs a new socket to > the Accumulator server running inside the pyspark daemon. This can cause a > buildup of used ephemeral ports from sockets in the TIME_WAIT termination > stage, which will cause the SparkContext to crash if too many tasks complete > too quickly. We ran into this bug with 17k tasks completing in 15 seconds. > This bug can be fixed outside of Spark by ensuring these properties are set > (on a linux server); > echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse > echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle > or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2485) Usage of HiveClient not threadsafe.
[ https://issues.apache.org/jira/browse/SPARK-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2485. --- Resolution: Fixed https://github.com/apache/spark/pull/1412 > Usage of HiveClient not threadsafe. > --- > > Key: SPARK-2485 > URL: https://issues.apache.org/jira/browse/SPARK-2485 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Michael Armbrust >Assignee: Michael Armbrust > > When making concurrent queries against the hive metastore, sometimes we get > an exception that includes the following stack trace: > {code} > Caused by: java.lang.Throwable: get_table failed: out of sequence response > at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76) > at > org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936) > {code} > Likely, we need to synchronize our use of HiveClient. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2154) Worker goes down.
[ https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060813#comment-14060813 ] Aaron Davidson commented on SPARK-2154: --- Created this PR to hopefully fix that: https://github.com/apache/spark/pull/1405 > Worker goes down. > - > > Key: SPARK-2154 > URL: https://issues.apache.org/jira/browse/SPARK-2154 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 0.8.1, 0.9.0, 1.0.0 > Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS >Reporter: siva venkat gogineni > Labels: patch > Attachments: Sccreenhot at various states of driver ..jpg > > > Worker dies when i try to submit drivers more than the allocated cores. When > I submit 9 drivers with one core for each driver on a cluster having 8 cores > all together the worker dies as soon as i submit the 9 the driver. It works > fine until it reaches 8 cores, As soon as i submit 9th driver the driver > status remains "Submitted" and the worker crashes. I understand that we > cannot run drivers more than the allocated cores but the problem here is > instead of the 9th driver being in queue it is being executed and as a result > it is crashing the worker. Let me know if there is a way to get around this > issue or is it being fixed in the upcoming version? > Cluster Details: > Spark 1.00 > 2 nodes with 4 cores each. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Closed] (SPARK-2453) Compound lines in spark-shell cause compilation errors
[ https://issues.apache.org/jira/browse/SPARK-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson closed SPARK-2453. - Resolution: Duplicate https://issues.apache.org/jira/browse/SPARK-2452 > Compound lines in spark-shell cause compilation errors > -- > > Key: SPARK-2453 > URL: https://issues.apache.org/jira/browse/SPARK-2453 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.1 >Reporter: Aaron Davidson >Priority: Critical > > repro: > {code} > scala> val x = 3; def z() = x + 4 > x: Int = 3 > z: ()Int > scala> z() > :11: error: $VAL5 is already defined as value $VAL5 > val $VAL5 = INSTANCE; > {code} > Everything's cool if the def is put on its own line. > This was caused by > https://github.com/apache/spark/commit/d43415075b3468fe8aa56de5d2907d409bb96347 > ([~prashant_]). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2453) Compound lines in spark-shell cause compilation errors
Aaron Davidson created SPARK-2453: - Summary: Compound lines in spark-shell cause compilation errors Key: SPARK-2453 URL: https://issues.apache.org/jira/browse/SPARK-2453 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Aaron Davidson Priority: Critical repro: {code} scala> val x = 3; def z() = x + 4 x: Int = 3 z: ()Int scala> z() :11: error: $VAL5 is already defined as value $VAL5 val $VAL5 = INSTANCE; {code} Everything's cool if the def is put on its own line. This was caused by https://github.com/apache/spark/commit/d43415075b3468fe8aa56de5d2907d409bb96347 ([~prashant_]). -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2412) CoalescedRDD throws exception with certain pref locs
Aaron Davidson created SPARK-2412: - Summary: CoalescedRDD throws exception with certain pref locs Key: SPARK-2412 URL: https://issues.apache.org/jira/browse/SPARK-2412 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid. The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2403) Spark stuck when class is not registered with Kryo
[ https://issues.apache.org/jira/browse/SPARK-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2403. --- Resolution: Fixed Fix Version/s: 1.0.2 1.1.0 > Spark stuck when class is not registered with Kryo > -- > > Key: SPARK-2403 > URL: https://issues.apache.org/jira/browse/SPARK-2403 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Daniel Darabos > Fix For: 1.1.0, 1.0.2 > > > We are using Kryo and require registering classes. When trying to serialize > something containing an unregistered class, Kryo will raise an exception. > DAGScheduler.submitMissingTasks runs in the scheduler thread and checks if > the contents of the task can be serialized by trying to serialize it: > https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L767 > It catches NotSerializableException and aborts the task with an error when > this happens. > The problem is, Kryo does not raise NotSerializableException for unregistered > classes. It raises IllegalArgumentException instead. This exception is not > caught and kills the scheduler thread. The application then hangs, waiting > indefinitely for the job to finish. > Catching IllegalArgumentException also is a quick fix. I'll send a pull > request for it if you agree. Thanks! -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap
[ https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2349. --- Resolution: Fixed https://github.com/apache/spark/pull/1288 > Fix NPE in ExternalAppendOnlyMap > > > Key: SPARK-2349 > URL: https://issues.apache.org/jira/browse/SPARK-2349 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.1.0 >Reporter: Andrew Or > > It throws an NPE on null keys. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2324) SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
[ https://issues.apache.org/jira/browse/SPARK-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2324. --- Resolution: Fixed Resolved by https://github.com/apache/spark/pull/1274 > SparkContext should not exit directly when spark.local.dir is a list of > multiple paths and one of them has error > > > Key: SPARK-2324 > URL: https://issues.apache.org/jira/browse/SPARK-2324 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: YanTang Zhai > > The spark.local.dir is configured as a list of multiple paths as follows > /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver > node has error, the application will exit since DiskBlockManager exits > directly at createLocalDirs. If the disk data2 of the worker node has error, > the executor will exit either. > DiskBlockManager should not exit directly at createLocalDirs if one of > spark.local.dir has error. Since spark.local.dir has multiple paths, a > problem should not affect the overall situation. > I think DiskBlockManager could ignore the bad directory at createLocalDirs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2282) PySpark crashes if too many tasks complete quickly
Aaron Davidson created SPARK-2282: - Summary: PySpark crashes if too many tasks complete quickly Key: SPARK-2282 URL: https://issues.apache.org/jira/browse/SPARK-2282 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0, 1.0.1 Reporter: Aaron Davidson Assignee: Aaron Davidson Upon every task completion, PythonAccumulatorParam constructs a new socket to the Accumulator server running inside the pyspark daemon. This can cause a buildup of used ephemeral ports from sockets in the TIME_WAIT termination stage, which will cause the SparkContext to crash if too many tasks complete too quickly. We ran into this bug with 17k tasks completing in 15 seconds. This bug can be fixed outside of Spark by ensuring these properties are set (on a linux server); echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle or by adding the SO_REUSEADDR option to the Socket creation within Spark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark
Aaron Davidson created SPARK-2203: - Summary: PySpark does not infer default numPartitions in same way as Spark Key: SPARK-2203 URL: https://issues.apache.org/jira/browse/SPARK-2203 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2147) Master UI forgets about Executors when application exits cleanly
[ https://issues.apache.org/jira/browse/SPARK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2147. --- Resolution: Fixed https://github.com/apache/spark/pull/1102 > Master UI forgets about Executors when application exits cleanly > > > Key: SPARK-2147 > URL: https://issues.apache.org/jira/browse/SPARK-2147 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.0.0 >Reporter: Aaron Davidson >Assignee: Andrew Or > > When an application exits cleanly, the Master will remove all executors from > the application's ApplicationInfo, causing the historic "Completed > Applications" page to report that there were no executors associated with > that application. > On the contrary, if the application exits uncleanly, then the Master will > remove the application FIRST, and will not actually remove the executors from > the ApplicationInfo page. This causes the executors to show up correctly in > the "Completed Applications" page. > The correct behavior would probably be to gather a history of all executors > (so we'd retain executors that we had at one point but were removed during > the job), and not remove lost executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-2161) UI should remember executors that have been removed
[ https://issues.apache.org/jira/browse/SPARK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-2161. --- Resolution: Fixed https://github.com/apache/spark/pull/1102 > UI should remember executors that have been removed > --- > > Key: SPARK-2161 > URL: https://issues.apache.org/jira/browse/SPARK-2161 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or > Fix For: 1.0.1 > > > This applies to all of SparkUI, MasterWebUI, and WorkerWebUI. If an executor > fails, it just disappears from these UIs. It would be helpful if you can see > the logs for why they failed on the UIs. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-937) Executors that exit cleanly should not have KILLED status
[ https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-937: - Fix Version/s: 1.0.1 > Executors that exit cleanly should not have KILLED status > - > > Key: SPARK-937 > URL: https://issues.apache.org/jira/browse/SPARK-937 > Project: Spark > Issue Type: Improvement >Affects Versions: 0.7.3 >Reporter: Aaron Davidson >Assignee: Kan Zhang >Priority: Critical > Fix For: 1.0.1, 1.1.0 > > > This is an unintuitive and overloaded status message when Executors are > killed during normal termination of an application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-937) Executors that exit cleanly should not have KILLED status
[ https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-937. -- Resolution: Fixed > Executors that exit cleanly should not have KILLED status > - > > Key: SPARK-937 > URL: https://issues.apache.org/jira/browse/SPARK-937 > Project: Spark > Issue Type: Improvement >Affects Versions: 0.7.3 >Reporter: Aaron Davidson >Assignee: Kan Zhang >Priority: Critical > Fix For: 1.1.0 > > > This is an unintuitive and overloaded status message when Executors are > killed during normal termination of an application. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2147) Master UI forgets about Executors when application exits cleanly
Aaron Davidson created SPARK-2147: - Summary: Master UI forgets about Executors when application exits cleanly Key: SPARK-2147 URL: https://issues.apache.org/jira/browse/SPARK-2147 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Andrew Or When an application exists cleanly, the Master will remove all executors from the application's ApplicationInfo, causing the historic "Completed Applications" page to report that there were no executors associated with that application. On the contrary, if the application exits uncleanly, then the Master will remove the application FIRST, and will not actually remove the executors from the ApplicationInfo page. This causes the executors to show up correctly in the "Completed Applications" page. The correct behavior would probably be to gather a history of all executors (so we'd retain executors that we had at one point but were removed during the job), and not remove lost executors. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029498#comment-14029498 ] Aaron Davidson commented on SPARK-983: -- The idea for SizeTrackingAppendOnlyMap is that we can estimate the size of an object with SizeEstimator, but doing so can be relatively costly. Rather than running this estimation on every element added to the map, we amortize the cost by sampling exponentially less often (similar to amortization of adding to an ArrayList). If you expect your sublists to be relatively large, it may be perfectly fine to just call SizeEstimator on each one. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin >Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types
Aaron Davidson created SPARK-2063: - Summary: Creating a SchemaRDD via sql() does not correctly resolve nested types Key: SPARK-2063 URL: https://issues.apache.org/jira/browse/SPARK-2063 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Michael Armbrust For example, from the typical twitter dataset: {code} scala> val popularTweets = sql("SELECT retweeted_status.text, MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status is not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30") scala> popularTweets.toString 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch MultiInstanceRelations 14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch CaseInsensitiveAttributeReferences org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to qualifiers on unresolved object, tree: 'retweeted_status.text at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51) at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251) at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69) at org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:94) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:93) at org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$
[jira] [Created] (SPARK-2028) Users of HadoopRDD cannot access the partition InputSplits
Aaron Davidson created SPARK-2028: - Summary: Users of HadoopRDD cannot access the partition InputSplits Key: SPARK-2028 URL: https://issues.apache.org/jira/browse/SPARK-2028 Project: Spark Issue Type: Bug Reporter: Aaron Davidson Assignee: Aaron Davidson If a user creates a HadoopRDD (e.g., via textFile), there is no way to find out which file it came from, though this information is contained in the InputSplit within the RDD. We should find a way to expose this publicly. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1899) Default log4j.properties incorrectly sends all output to stderr and none to stdout
[ https://issues.apache.org/jira/browse/SPARK-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1899. --- Resolution: Won't Fix See discussion at: https://github.com/apache/spark/pull/852 > Default log4j.properties incorrectly sends all output to stderr and none to > stdout > -- > > Key: SPARK-1899 > URL: https://issues.apache.org/jira/browse/SPARK-1899 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.0.0 >Reporter: Andrew Ash > > Standard practice for unix applications is to send most output to stdout, and > errors to stderr. The default log4j.properties file that Spark includes > incorrectly sends all output to stderr and none to stdout. Update the > default log4j to split the output appropriately. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2027) spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath
Aaron Davidson created SPARK-2027: - Summary: spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath Key: SPARK-2027 URL: https://issues.apache.org/jira/browse/SPARK-2027 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Aaron Davidson Assignee: Aaron Davidson -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit
[ https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson resolved SPARK-1901. --- Resolution: Fixed > Standalone worker update exector's state ahead of executor process exit > --- > > Key: SPARK-1901 > URL: https://issues.apache.org/jira/browse/SPARK-1901 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 0.9.0 > Environment: spark-1.0 rc10 >Reporter: Zhen Peng >Assignee: Zhen Peng > Fix For: 1.0.1 > > > Standalone worker updates executor's state prematurely, making the resource > status in an inconsistent state until the executor process really died. > In our cluster, we found this situation may cause new submitted applications > removed by Master for launching executor fail. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-1966) Cannot cancel tasks running locally
Aaron Davidson created SPARK-1966: - Summary: Cannot cancel tasks running locally Key: SPARK-1966 URL: https://issues.apache.org/jira/browse/SPARK-1966 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 0.9.1, 1.0.0 Reporter: Aaron Davidson -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190 ] Aaron Davidson edited comment on SPARK-983 at 5/27/14 9:54 PM: --- Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. Edit: Figured it out, thanks [~pwendell]! was (Author: ilikerps): Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin >Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Updated] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aaron Davidson updated SPARK-983: - Assignee: Madhu Siddalingaiah > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin >Assignee: Madhu Siddalingaiah > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190 ] Aaron Davidson commented on SPARK-983: -- Does sound reasonable. For some reason it does not allow me to assign the issue to you, though. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing
[ https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009293#comment-14009293 ] Aaron Davidson commented on SPARK-1855: --- I agree that significant improvements can be made to Spark's block replication model, but there's no reason it shouldn't "work" (albeit with potentially poor write performance and fewer guarantees than one would like) if you increase the replication level higher than 2, which is possible using [StorageLevel#apply|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L155]. > Provide memory-and-local-disk RDD checkpointing > --- > > Key: SPARK-1855 > URL: https://issues.apache.org/jira/browse/SPARK-1855 > Project: Spark > Issue Type: New Feature > Components: MLlib, Spark Core >Affects Versions: 1.0.0 >Reporter: Xiangrui Meng > > Checkpointing is used to cut long lineage while maintaining fault tolerance. > The current implementation is HDFS-based. Using the BlockRDD we can create > in-memory-and-local-disk (with replication) checkpoints that are not as > reliable as HDFS-based solution but faster. > It can help applications that require many iterations. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()
[ https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008557#comment-14008557 ] Aaron Davidson commented on SPARK-983: -- [~pwendell] or [~matei], any opinions on memory management best practices? Adding a new memoryFraction for sorting will only exacerbate the problems we see with them, but I'm not sure we can rely on Runtime.freeMemory() as even an intermediary solution. Perhaps this feature could draw from the same pool as shuffle.memoryFraction, as it's used for a similar purpose, and that pool already implements some notion of memory sharing. > Support external sorting for RDD#sortByKey() > > > Key: SPARK-983 > URL: https://issues.apache.org/jira/browse/SPARK-983 > Project: Spark > Issue Type: New Feature >Affects Versions: 0.9.0 >Reporter: Reynold Xin > > Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a > buffer to hold the entire partition, then sorts it. This will cause an OOM if > an entire partition cannot fit in memory, which is especially problematic for > skewed data. Rather than OOMing, the behavior should be similar to the > [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala], > where we fallback to disk if we detect memory pressure. -- This message was sent by Atlassian JIRA (v6.2#6252)