[jira] [Resolved] (SPARK-4264) SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator

2014-11-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4264.
---
Resolution: Fixed

> SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator
> --
>
> Key: SPARK-4264
> URL: https://issues.apache.org/jira/browse/SPARK-4264
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Blocker
>
> This is because it calls hasNext twice, which invokes the completion iterator 
> twice, unintuitively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4277) Support external shuffle service on Worker

2014-11-06 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4277:
-

 Summary: Support external shuffle service on Worker
 Key: SPARK-4277
 URL: https://issues.apache.org/jira/browse/SPARK-4277
 Project: Spark
  Issue Type: Bug
  Components: Deploy, Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson


It's less useful to have an external shuffle service on the Spark Standalone 
Worker than on YARN or Mesos (as executor allocations tend to be more static), 
but it would be good to help test the code path. It would also make Spark more 
resilient to particular executor failures.

Cool side-feature: When SPARK-4236 is fixed and integrated, the Worker will 
take care of cleaning up executor directories, which will mean executors 
terminate more quickly and that we don't leak data if the executor dies 
forcefully.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4238) Perform network-level retry of shuffle file fetches

2014-11-06 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200526#comment-14200526
 ] 

Aaron Davidson commented on SPARK-4238:
---

Oh, whoops.

> Perform network-level retry of shuffle file fetches
> ---
>
> Key: SPARK-4238
> URL: https://issues.apache.org/jira/browse/SPARK-4238
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
>
> During periods of high network (or GC) load, it is not uncommon that 
> IOExceptions crop up around connection failures when fetching shuffle files. 
> Unfortunately, when such a failure occurs, it is interpreted as an inability 
> to fetch the files, which causes us to mark the executor as lost and 
> recompute all of its shuffle outputs.
> We should allow retrying at the network level in the event of an IOException 
> in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4188) Shuffle fetches should be retried at a lower level

2014-11-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-4188:
--
Description: 
During periods of high network (or GC) load, it is not uncommon that 
IOExceptions crop up around connection failures when fetching shuffle files. 
Unfortunately, when such a failure occurs, it is interpreted as an inability to 
fetch the files, which causes us to mark the executor as lost and recompute all 
of its shuffle outputs.
We should allow retrying at the network level in the event of an IOException in 
order to avoid this circumstance.


  was:Sometimes fetches will fail due to garbage collection pauses or network 
load. A simple retry could save recomputation of a lot of shuffle data, 
especially if it's below the task level (i.e., on the level of a single fetch).


> Shuffle fetches should be retried at a lower level
> --
>
> Key: SPARK-4188
> URL: https://issues.apache.org/jira/browse/SPARK-4188
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Aaron Davidson
>
> During periods of high network (or GC) load, it is not uncommon that 
> IOExceptions crop up around connection failures when fetching shuffle files. 
> Unfortunately, when such a failure occurs, it is interpreted as an inability 
> to fetch the files, which causes us to mark the executor as lost and 
> recompute all of its shuffle outputs.
> We should allow retrying at the network level in the event of an IOException 
> in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4238) Perform network-level retry of shuffle file fetches

2014-11-06 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson closed SPARK-4238.
-
Resolution: Duplicate

> Perform network-level retry of shuffle file fetches
> ---
>
> Key: SPARK-4238
> URL: https://issues.apache.org/jira/browse/SPARK-4238
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
>
> During periods of high network (or GC) load, it is not uncommon that 
> IOExceptions crop up around connection failures when fetching shuffle files. 
> Unfortunately, when such a failure occurs, it is interpreted as an inability 
> to fetch the files, which causes us to mark the executor as lost and 
> recompute all of its shuffle outputs.
> We should allow retrying at the network level in the event of an IOException 
> in order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-06 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14200519#comment-14200519
 ] 

Aaron Davidson commented on SPARK-2468:
---

[~zzcclp] Use of epoll mode is highly dependent on your environment, and I 
personally would not recommend it due to known netty bugs which may cause it to 
be less stable. We have found nio mode to be sufficiently performant in our 
testing (and netty actually still tries to use epoll if it's available as its 
selector).

[~lianhuiwang] Could you please elaborate on what you mean?

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2468) Netty-based block server / client module

2014-11-05 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199911#comment-14199911
 ] 

Aaron Davidson edited comment on SPARK-2468 at 11/6/14 6:57 AM:


This could be due to the netty transfer service allocating more off-heap byte 
buffers, which perhaps is accounted for differently by YARN. [PR 
#3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33],
 which should go in tomorrow, will include a way to avoid allocating off-heap 
buffers (by setting the spark config 
"spark.shuffle.io.preferDirectBufs=false"), which should either solve your 
problem or at least produce the more typical OutOfMemoryError.


was (Author: ilikerps):
This could be due to the netty transfer service allocating more off-heap byte 
buffers, which perhaps is accounted for differently by YARN. [PR 
#3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33],
 which should go in tomorrow, will include a way to avoid allocating off-heap 
buffers, which should either solve your problem or at least produce the more 
typical OutOfMemoryError.

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-11-05 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14199911#comment-14199911
 ] 

Aaron Davidson commented on SPARK-2468:
---

This could be due to the netty transfer service allocating more off-heap byte 
buffers, which perhaps is accounted for differently by YARN. [PR 
#3101|https://github.com/apache/spark/pull/3101/files#diff-d2ce9b38bdc38ca9d7119f9c2cf79907R33],
 which should go in tomorrow, will include a way to avoid allocating off-heap 
buffers, which should either solve your problem or at least produce the more 
typical OutOfMemoryError.

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4264) SQL HashJoin induces "refCnt = 0" error in ShuffleBlockFetcherIterator

2014-11-05 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4264:
-

 Summary: SQL HashJoin induces "refCnt = 0" error in 
ShuffleBlockFetcherIterator
 Key: SPARK-4264
 URL: https://issues.apache.org/jira/browse/SPARK-4264
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Blocker


This is because it calls hasNext twice, which invokes the completion iterator 
twice, unintuitively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4242) Add SASL to external shuffle service

2014-11-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4242:
-

 Summary: Add SASL to external shuffle service
 Key: SPARK-4242
 URL: https://issues.apache.org/jira/browse/SPARK-4242
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson


It's already added to NettyBlockTransferService, let's just add it to 
ExternalShuffleClient as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4238) Perform network-level retry of shuffle file fetches

2014-11-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4238:
-

 Summary: Perform network-level retry of shuffle file fetches
 Key: SPARK-4238
 URL: https://issues.apache.org/jira/browse/SPARK-4238
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


During periods of high network (or GC) load, it is not uncommon that 
IOExceptions crop up around connection failures when fetching shuffle files. 
Unfortunately, when such a failure occurs, it is interpreted as an inability to 
fetch the files, which causes us to mark the executor as lost and recompute all 
of its shuffle outputs.

We should allow retrying at the network level in the event of an IOException in 
order to avoid this circumstance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4236) External shuffle service must cleanup its shuffle files

2014-11-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4236:
-

 Summary: External shuffle service must cleanup its shuffle files
 Key: SPARK-4236
 URL: https://issues.apache.org/jira/browse/SPARK-4236
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Critical


When external shuffle service is enabled, executors no longer cleanup their own 
shuffle files, as the external server can continue to serve them. The external 
serer must clean these files up once it knows that they will never be used 
again (i.e., when the application terminates or when Spark's ContextCleaner 
requests that they be deleted).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4163) When fetching blocks unsuccessfully, Web UI only displays "Fetch failure"

2014-11-02 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4163.
---
   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Shixiong Zhu

> When fetching blocks unsuccessfully, Web UI only displays "Fetch failure"
> -
>
> Key: SPARK-4163
> URL: https://issues.apache.org/jira/browse/SPARK-4163
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, Web UI
>Affects Versions: 1.0.0, 1.1.0
>Reporter: Shixiong Zhu
>Assignee: Shixiong Zhu
>Priority: Minor
> Fix For: 1.2.0
>
>
> When fetching blocks unsuccessfully, Web UI only displays "Fetch failure". 
> It's hard to find out the cause of the fetch failure. Web UI should display 
> the stack track for the fetch failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4198) Refactor BlockManager's doPut and doGetLocal into smaller pieces

2014-11-02 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4198:
-

 Summary: Refactor BlockManager's doPut and doGetLocal into smaller 
pieces
 Key: SPARK-4198
 URL: https://issues.apache.org/jira/browse/SPARK-4198
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently the BlockManager is pretty hard to grok due to a lot of complicated, 
interconnected details. Two particularly smelly pieces of code are doPut and 
doGetLocal, which are both around 200 lines long, with heavily nested 
statements and lots of orthogonal logic. We should break this up into smaller 
pieces.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4189) FileSegmentManagedBuffer should have a configurable memory map threshold

2014-11-01 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4189:
-

 Summary: FileSegmentManagedBuffer should have a configurable 
memory map threshold
 Key: SPARK-4189
 URL: https://issues.apache.org/jira/browse/SPARK-4189
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson


One size does not fit all, it would be useful if there was a configuration to 
change the threshold at which we memory map shuffle files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4188) Shuffle fetches should be retried at a lower level

2014-11-01 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4188:
-

 Summary: Shuffle fetches should be retried at a lower level
 Key: SPARK-4188
 URL: https://issues.apache.org/jira/browse/SPARK-4188
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson


Sometimes fetches will fail due to garbage collection pauses or network load. A 
simple retry could save recomputation of a lot of shuffle data, especially if 
it's below the task level (i.e., on the level of a single fetch).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4187) External shuffle service should not use Java serializer

2014-11-01 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4187:
-

 Summary: External shuffle service should not use Java serializer
 Key: SPARK-4187
 URL: https://issues.apache.org/jira/browse/SPARK-4187
 Project: Spark
  Issue Type: Improvement
Reporter: Aaron Davidson


The Java serializer does not make a good solution for talking between JVMs, as 
the format may change between versions of the code, even if the class itself 
does not change in an incompatible manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4187) External shuffle service should not use Java serializer

2014-11-01 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-4187:
--
  Component/s: Spark Core
Affects Version/s: 1.2.0

> External shuffle service should not use Java serializer
> ---
>
> Key: SPARK-4187
> URL: https://issues.apache.org/jira/browse/SPARK-4187
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.2.0
>Reporter: Aaron Davidson
>
> The Java serializer does not make a good solution for talking between JVMs, 
> as the format may change between versions of the code, even if the class 
> itself does not change in an incompatible manner.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4183) Enable Netty-based BlockTransferService by default

2014-11-01 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4183:
-

 Summary: Enable Netty-based BlockTransferService by default
 Key: SPARK-4183
 URL: https://issues.apache.org/jira/browse/SPARK-4183
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Spark's NettyBlockTransferService relies on Netty to achieve high performance 
and zero-copy IO, which is a big simplification over the manual connection 
management that's done in today's NioBlockTransferService. Additionally, the 
core functionality of the NettyBlockTransferService has been extracted out of 
spark core into the "network" package, with the intention of reusing this code 
for SPARK-3796 ([PR 
#3001|https://github.com/apache/spark/pull/3001/files#diff-54]).

We should turn NettyBlockTransferService on by default in order to improve 
debuggability and stability of the network layer (which has historically been 
more challenging with the current BlockTransferService).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4125) Serializer should work on ManagedBuffers as well

2014-10-28 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4125?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14187696#comment-14187696
 ] 

Aaron Davidson commented on SPARK-4125:
---

cc [~rxin]

> Serializer should work on ManagedBuffers as well
> 
>
> Key: SPARK-4125
> URL: https://issues.apache.org/jira/browse/SPARK-4125
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Aaron Davidson
>
> As an API nice-to-have, we should probably try to make ManagedBuffers a 
> first-class citizen throughout Spark to minimize the allocation of 
> unnecessary buffers. As part of this, we should make Serializer accept 
> ManagedBuffer instead of just ByteBuffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4125) Serializer should work on ManagedBuffers as well

2014-10-28 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4125:
-

 Summary: Serializer should work on ManagedBuffers as well
 Key: SPARK-4125
 URL: https://issues.apache.org/jira/browse/SPARK-4125
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson


As an API nice-to-have, we should probably try to make ManagedBuffers a 
first-class citizen throughout Spark to minimize the allocation of unnecessary 
buffers. As part of this, we should make Serializer accept ManagedBuffer 
instead of just ByteBuffer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-4084) Reuse sort key in Sorter

2014-10-28 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-4084.
---
   Resolution: Fixed
Fix Version/s: 1.2.0

> Reuse sort key in Sorter
> 
>
> Key: SPARK-4084
> URL: https://issues.apache.org/jira/browse/SPARK-4084
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.2.0
>
>
> Sorter uses generic-typed key for sorting. When data is large, it creates 
> lots of key objects, which is not efficient. We should reuse the key in 
> Sorter for memory efficiency. This change is part of the petabyte sort 
> implementation from [~rxin].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-4106) Shuffle write and spill to disk metrics are not incorrect

2014-10-27 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-4106:
-

 Summary: Shuffle write and spill to disk metrics are not incorrect
 Key: SPARK-4106
 URL: https://issues.apache.org/jira/browse/SPARK-4106
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Critical


I have an encountered a job which has some disk spilled (memory) but the disk 
spilled (disk) is 0, as well as the shuffle write. If I switch to hash based 
shuffle, where there happens to be no disk spilling, then the shuffle write is 
correct.

I can get more info on a workload to repro this situation, but perhaps that 
state of events is sufficient.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3994) countByKey / countByValue do not go through Aggregator

2014-10-17 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3994:
-

 Summary: countByKey / countByValue do not go through Aggregator
 Key: SPARK-3994
 URL: https://issues.apache.org/jira/browse/SPARK-3994
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson


The implementations of these methods are historical remnants of Spark from a 
time when the shuffle may have been nonexistent. Now, they can be simplified by 
plugging into reduceByKey(), potentially seeing performance and stability 
improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-3994) countByKey / countByValue do not go through Aggregator

2014-10-17 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson reassigned SPARK-3994:
-

Assignee: Aaron Davidson

> countByKey / countByValue do not go through Aggregator
> --
>
> Key: SPARK-3994
> URL: https://issues.apache.org/jira/browse/SPARK-3994
> Project: Spark
>  Issue Type: Bug
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>
> The implementations of these methods are historical remnants of Spark from a 
> time when the shuffle may have been nonexistent. Now, they can be simplified 
> by plugging into reduceByKey(), potentially seeing performance and stability 
> improvements.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3950) Completed time is blank for some successful tasks

2014-10-14 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14171563#comment-14171563
 ] 

Aaron Davidson commented on SPARK-3950:
---

[~andrewor14]

> Completed time is blank for some successful tasks
> -
>
> Key: SPARK-3950
> URL: https://issues.apache.org/jira/browse/SPARK-3950
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.1.1
>Reporter: Aaron Davidson
>
> In the Spark web UI, some tasks appear to have a blank Duration column. It's 
> possible that these ran for <.5 seconds, but if so, we should use 
> milliseconds like we do for GC time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3950) Completed time is blank for some successful tasks

2014-10-14 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3950:
-

 Summary: Completed time is blank for some successful tasks
 Key: SPARK-3950
 URL: https://issues.apache.org/jira/browse/SPARK-3950
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.1.1
Reporter: Aaron Davidson


In the Spark web UI, some tasks appear to have a blank Duration column. It's 
possible that these ran for <.5 seconds, but if so, we should use milliseconds 
like we do for GC time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3923) All Standalone Mode services time out with each other

2014-10-13 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14169044#comment-14169044
 ] 

Aaron Davidson commented on SPARK-3923:
---

I did a little digging hoping to find some post about this, no particular luck. 
I did find [this 
post|https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs] which 
recommends using an interval time < pause, which we are not doing. This doesn't 
seem to explain the services all timing out after the heartbeat interval time 
(which is currently 1000 seconds), but may be good to know in the future.

> All Standalone Mode services time out with each other
> -
>
> Key: SPARK-3923
> URL: https://issues.apache.org/jira/browse/SPARK-3923
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.2.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> I'm seeing an issue where it seems that components in Standalone Mode 
> (Worker, Master, Driver, and Executor) all seem to time out with each other 
> after around 1000 seconds. Here is an example log:
> {code}
> 14/10/13 06:43:55 INFO Master: Registering worker 
> ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
> 14/10/13 06:43:55 INFO Master: Registering worker 
> ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
> 14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
> 14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID 
> app-20141013064356-
> ... precisely 1000 seconds later ...
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
> system 
> [akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got 
> disassociated, removing it.
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
>  was not delivered. [2] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
> disassociated, removing it.
> 14/10/13 07:00:35 INFO Master: Removing worker 
> worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on 
> ip-10-0-175-214.us-west-2.compute.internal:42918
> 14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
> 14/10/13 07:00:35 INFO Master: 
> akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
> disassociated, removing it.
> 14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
> system 
> [akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has 
> failed, address is now gated for [5000] ms. Reason is: [Disassociated].
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
>  was not delivered. [3] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:35 INFO LocalActorRef: Message 
> [akka.remote.transport.AssociationHandle$Disassociated] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
>  was not delivered. [4] dead letters encountered. This logging can be turned 
> off or adjusted with configuration settings 'akka.log-dead-letters' and 
> 'akka.log-dead-letters-during-shutdown'.
> 14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake 
> timed out or transport failure detector triggered.
> 14/10/13 07:00:36 INFO Master: 
> akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
> disassociated, removing it.
> 14/10/13 07:00:36 INFO LocalActorRef: Message 
> [akka.remote.transport.AssociationHandle$InboundPayload] from 
> Actor[akka://sparkMaster/deadLetters] to 
> Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#194

[jira] [Created] (SPARK-3923) All Standalone Mode services time out with each other

2014-10-13 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3923:
-

 Summary: All Standalone Mode services time out with each other
 Key: SPARK-3923
 URL: https://issues.apache.org/jira/browse/SPARK-3923
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.2.0
Reporter: Aaron Davidson
Priority: Blocker


I'm seeing an issue where it seems that components in Standalone Mode (Worker, 
Master, Driver, and Executor) all seem to time out with each other after around 
1000 seconds. Here is an example log:

{code}
14/10/13 06:43:55 INFO Master: Registering worker 
ip-10-0-147-189.us-west-2.compute.internal:38922 with 4 cores, 29.0 GB RAM
14/10/13 06:43:55 INFO Master: Registering worker 
ip-10-0-175-214.us-west-2.compute.internal:42918 with 4 cores, 59.0 GB RAM
14/10/13 06:43:56 INFO Master: Registering app Databricks Shell
14/10/13 06:43:56 INFO Master: Registered app Databricks Shell with ID 
app-20141013064356-

... precisely 1000 seconds later ...

14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO Master: 
akka.tcp://sparkwor...@ip-10-0-147-189.us-west-2.compute.internal:38922 got 
disassociated, removing it.
14/10/13 07:00:35 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.147.189%3A54956-1#1529980245]
 was not delivered. [2] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO Master: 
akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
disassociated, removing it.
14/10/13 07:00:35 INFO Master: Removing worker 
worker-20141013064354-ip-10-0-175-214.us-west-2.compute.internal-42918 on 
ip-10-0-175-214.us-west-2.compute.internal:42918
14/10/13 07:00:35 INFO Master: Telling app of lost executor: 1
14/10/13 07:00:35 INFO Master: 
akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918 got 
disassociated, removing it.
14/10/13 07:00:35 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkwor...@ip-10-0-175-214.us-west-2.compute.internal:42918] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:35 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
 was not delivered. [3] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:35 INFO LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$Disassociated] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.214%3A35958-2#314633324]
 was not delivered. [4] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO ProtocolStateActor: No response from remote. Handshake 
timed out or transport failure detector triggered.
14/10/13 07:00:36 INFO Master: 
akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259 got 
disassociated, removing it.
14/10/13 07:00:36 INFO LocalActorRef: Message 
[akka.remote.transport.AssociationHandle$InboundPayload] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.0.175.215%3A41987-3#1944377249]
 was not delivered. [5] dead letters encountered. This logging can be turned 
off or adjusted with configuration settings 'akka.log-dead-letters' and 
'akka.log-dead-letters-during-shutdown'.
14/10/13 07:00:36 INFO Master: Removing app app-20141013064356-
14/10/13 07:00:36 WARN ReliableDeliverySupervisor: Association with remote 
system 
[akka.tcp://sparkdri...@ip-10-0-175-215.us-west-2.compute.internal:58259] has 
failed, address is now gated for [5000] ms. Reason is: [Disassociated].
14/10/13 07:00:36 INFO LocalActorRef: Message 
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from 
Actor[akka://sparkMaster/deadLetters] to 
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2

[jira] [Updated] (SPARK-3921) WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl

2014-10-12 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3921:
--
Description: 
As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to have lost its WorkerWatcher, because of the swapped 
workerUrl and appId parameters. We still put workerUrl before appId when we 
start standalone executors, and the Executor misinterprets the appId as the 
workerUrl and fails to create the WorkerWatcher.

Note that this does not seem to crash the Standalone executor mode, despite the 
failing of the WorkerWatcher during its constructor.

  was:As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to be broken, because of the swapped workerUrl and 
appId parameters. We still put workerUrl before appId when we start standalone 
executors, and the Executor misinterprets the appId as the workerUrl and fails 
to create the WorkerWatcher.

Summary: WorkerWatcher in Standalone mode fail to come up due to 
invalid workerUrl  (was: Executors in Standalone mode fail to come up due to 
invalid workerUrl)

> WorkerWatcher in Standalone mode fail to come up due to invalid workerUrl
> -
>
> Key: SPARK-3921
> URL: https://issues.apache.org/jira/browse/SPARK-3921
> Project: Spark
>  Issue Type: Bug
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
>
> As of [this 
> commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
>  standalone mode appears to have lost its WorkerWatcher, because of the 
> swapped workerUrl and appId parameters. We still put workerUrl before appId 
> when we start standalone executors, and the Executor misinterprets the appId 
> as the workerUrl and fails to create the WorkerWatcher.
> Note that this does not seem to crash the Standalone executor mode, despite 
> the failing of the WorkerWatcher during its constructor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3921) Executors in Standalone mode fail to come up due to invalid workerUrl

2014-10-12 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3921:
-

 Summary: Executors in Standalone mode fail to come up due to 
invalid workerUrl
 Key: SPARK-3921
 URL: https://issues.apache.org/jira/browse/SPARK-3921
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


As of [this 
commit|https://github.com/apache/spark/commit/79e45c9323455a51f25ed9acd0edd8682b4bbb88#diff-79391110e9f26657e415aa169a004998R153],
 standalone mode appears to be broken, because of the swapped workerUrl and 
appId parameters. We still put workerUrl before appId when we start standalone 
executors, and the Executor misinterprets the appId as the workerUrl and fails 
to create the WorkerWatcher.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-10 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14167647#comment-14167647
 ] 

Aaron Davidson commented on SPARK-3889:
---

Sorry, it was not linked: https://github.com/apache/spark/pull/2742

> JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
> ---
>
> Key: SPARK-3889
> URL: https://issues.apache.org/jira/browse/SPARK-3889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
> Fix For: 1.2.0
>
>
> Here's the first part of the core dump, possibly caused by a job which 
> shuffles a lot of very small partitions.
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
> #
> # JRE version: 7.0_25-b30
> # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # v  ~StubRoutines::jbyte_disjoint_arraycopy
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please include
> # instructions on how to reproduce the bug and visit:
> #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
> #
> ---  T H R E A D  ---
> Current thread (0x7fa4b0631000):  JavaThread "Executor task launch 
> worker-170" daemon [_thread_in_Java, id=6783, 
> stack(0x7fa4448ef000,0x7fa4449f)]
> siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
> si_addr=0x7fa428f79000
> {code}
> Here is the only useful content I can find related to JVM and SIGBUS from 
> Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664
> It appears it may be related to disposing byte buffers, which we do in the 
> ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
> them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-09 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3889:
--
Description: 
Here's the first part of the core dump, possibly caused by a job which shuffles 
a lot of very small partitions.

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread "Executor task launch 
worker-170" daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.

  was:
Here's the first part of the core dump:

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread "Executor task launch 
worker-170" daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.


> JVM dies with SIGBUS, resulting in ConnectionManager failed ACK
> ---
>
> Key: SPARK-3889
> URL: https://issues.apache.org/jira/browse/SPARK-3889
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
>Priority: Critical
>
> Here's the first part of the core dump, possibly caused by a job which 
> shuffles a lot of very small partitions.
> {code}
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
> #
> # JRE version: 7.0_25-b30
> # Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 
> compressed oops)
> # Problematic frame:
> # v  ~StubRoutines::jbyte_disjoint_arraycopy
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core 
> dumping, try "ulimit -c unlimited" before starting Java again
> #
> # If you would like to submit a bug report, please include
> # instructions on how to reproduce the bug and visit:
> #   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
> #
> ---  T H R E A D  ---
> Current thread (0x7fa4b0631000):  JavaThread "Executor task launch 
> worker-170" daemon [_thread_in_Java, id=6783, 
> stack(0x7fa4448ef000,0x7fa4449f)]
> siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
> si_addr=0x7fa428f79000
> {code}
> Here is the only useful content I can find related to JVM and SIGBUS from 
> Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664
> It appears it may be related to disposing byte buffers, which we do in the 
> ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
> them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-3889) JVM dies with SIGBUS, resulting in ConnectionManager failed ACK

2014-10-09 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3889:
-

 Summary: JVM dies with SIGBUS, resulting in ConnectionManager 
failed ACK
 Key: SPARK-3889
 URL: https://issues.apache.org/jira/browse/SPARK-3889
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson
Priority: Critical


Here's the first part of the core dump:

{code}
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGBUS (0x7) at pc=0x7fa5885fcdb0, pid=488, tid=140343502632704
#
# JRE version: 7.0_25-b30
# Java VM: OpenJDK 64-Bit Server VM (23.7-b01 mixed mode linux-amd64 compressed 
oops)
# Problematic frame:
# v  ~StubRoutines::jbyte_disjoint_arraycopy
#
# Failed to write core dump. Core dumps have been disabled. To enable core 
dumping, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please include
# instructions on how to reproduce the bug and visit:
#   https://bugs.launchpad.net/ubuntu/+source/openjdk-7/
#

---  T H R E A D  ---

Current thread (0x7fa4b0631000):  JavaThread "Executor task launch 
worker-170" daemon [_thread_in_Java, id=6783, 
stack(0x7fa4448ef000,0x7fa4449f)]

siginfo:si_signo=SIGBUS: si_errno=0, si_code=2 (BUS_ADRERR), 
si_addr=0x7fa428f79000
{code}

Here is the only useful content I can find related to JVM and SIGBUS from 
Google: https://bugzilla.redhat.com/show_bug.cgi?format=multiple&id=976664

It appears it may be related to disposing byte buffers, which we do in the 
ConnectionManager -- we mmap shuffle files via ManagedBuffer and dispose of 
them in BufferMessage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3796) Create shuffle service for external block storage

2014-10-09 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3796?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-3796:
--
Description: 
This task will be broken up into two parts -- the first, being to refactor our 
internal shuffle service to use a BlockTransferService which we can easily 
extract out into its own service, and then the second is to actually do the 
extraction.

Here is the design document for the low-level service, nicknamed "Sluice", on 
top of which will be Spark's BlockTransferService API:
https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0

> Create shuffle service for external block storage
> -
>
> Key: SPARK-3796
> URL: https://issues.apache.org/jira/browse/SPARK-3796
> Project: Spark
>  Issue Type: Sub-task
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Aaron Davidson
>
> This task will be broken up into two parts -- the first, being to refactor 
> our internal shuffle service to use a BlockTransferService which we can 
> easily extract out into its own service, and then the second is to actually 
> do the extraction.
> Here is the design document for the low-level service, nicknamed "Sluice", on 
> top of which will be Spark's BlockTransferService API:
> https://docs.google.com/document/d/1zKf3qloBu3dmv2AFyQTwEpumWRPUT5bcAUKB5PGNfx0



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3805) Enable Standalone worker cleanup by default

2014-10-08 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14164485#comment-14164485
 ] 

Aaron Davidson commented on SPARK-3805:
---

Upon further thought, I also agree with Patrick. My initial inclination was to 
go back to the original behavior after the issue was resolved, but it seems 
more dangerous to delete user data than to not delete it. A user who sees an 
accumulation of state will look for a way to remove it (and find this feature), 
but any user who needs to find now-deleted data is going to be irreversibly sad.

> Enable Standalone worker cleanup by default
> ---
>
> Key: SPARK-3805
> URL: https://issues.apache.org/jira/browse/SPARK-3805
> Project: Spark
>  Issue Type: Task
>  Components: Deploy
>Reporter: Andrew Ash
>
> Now that SPARK-1860 is fixed we should be able to turn on 
> {{spark.worker.cleanup.enabled}} by default



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-04 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14159343#comment-14159343
 ] 

Aaron Davidson commented on SPARK-1860:
---

Agreed, that sounds good. Would you or [~mccheah] be able to create a quick PR 
for this?

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1860.
---
Resolution: Fixed

Fixed by mccheah in https://github.com/apache/spark/pull/2609

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-10-01 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155303#comment-14155303
 ] 

Aaron Davidson commented on SPARK-1860:
---

The Worker itself is solely a Standalone mode construct, so AFAIK, this is not 
an issue.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-30 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14154228#comment-14154228
 ] 

Aaron Davidson commented on SPARK-1860:
---

Your logic SGTM, but I would add one additional check to avoid deleting the 
directory for an Application which still has running Executors on that node, 
just to make absolutely sure that we don't delete app directories that just 
happen to sit idle for a while. This check can be performed by iterating over 
the "executors" map in Worker.scala and matching the appId with the app 
directory's name.


> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152839#comment-14152839
 ] 

Aaron Davidson commented on SPARK-1860:
---

The Executor could clean up its own jars when it terminates normally, that 
seems fine. The impact of this seems limited, though, and it's a good idea to 
limit the scope of shutdown hooks as much as possible.

There are three classes of things to delete:
1. Shuffle files / block manager blocks -- large -- deleted by graceful 
Executor termination. Can be deleted immediately.
2. Uploaded jars / files -- usually small -- deleted by Worker cleanup. Can be 
deleted immediately.
3. Logs -- small to medium -- deleted by Worker cleanup. Should not be deleted 
immediately.

Number 1 is most critical in terms of impact on the system. Numbers 2 and 3 are 
of the same order of magnitude in size, so cleaning up 2 and not 3 is not 
expected to improve the system's stability by more than a factor of ~2x 
applications.

Note that the intentions of this particular JIRA are very simple: cleanup 2 and 
3 for all executors several days after they have terminated, rather than after 
they have started. If you wish to expand the scope of the Worker or Executor 
cleanup, that should be covered in a separate JIRA (which is welcome -- I just 
want to make sure we're on the same page about this particular issue!).

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-09-29 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14152744#comment-14152744
 ] 

Aaron Davidson commented on SPARK-1860:
---

Note that there are two separate forms of cleanup: application data cleanup 
(jars and logs) and shuffle data cleanup. Standalone Worker cleanup deals with 
the former, Executor termination handlers deal with the latter. The purpose is 
not to deal with executors that have terminated ungracefully, but to actually 
clean up old application directories.

Here the idea is that a Worker may be running for a very long time (weeks, 
months) and over time accumulates hundreds of application directories. We want 
to delete these directories after several days of them being terminated (today 
we'll clean them up whether or not they're terminated, which loses their jars 
and logs), after which we presumably don't care anymore. We do not want to 
clean them up immediately after application termination.

The Worker performing shuffle data cleanup for ungracefully terminated 
Executors is not a bad idea, but is a (smallish) feature onto itself, as the 
Worker does not currently know where a particular Executor is storing its data.

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Blocker
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2973) Add a way to show tables without executing a job

2014-09-26 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-2973:
--
Assignee: Michael Armbrust  (was: Cheng Lian)

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Michael Armbrust
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-2973) Add a way to show tables without executing a job

2014-09-26 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson reopened SPARK-2973:
---

Reopening this because sql("show tables").take(1) still starts a job.

> Add a way to show tables without executing a job
> 
>
> Key: SPARK-2973
> URL: https://issues.apache.org/jira/browse/SPARK-2973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Aaron Davidson
>Assignee: Cheng Lian
>Priority: Critical
> Fix For: 1.2.0
>
>
> Right now, sql("show tables").collect() will start a Spark job which shows up 
> in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2014-09-24 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146039#comment-14146039
 ] 

Aaron Davidson commented on SPARK-3267:
---

I don't have it anymore, unfortunately. Michael and I did a little digging at 
the time, and I think we found the reason for the deadlock, shown in the stack 
traces above, but decided it was a very unlikely scenario. Indeed, the query 
did not consistently deadlock; this only occurred a single time.

> Deadlock between ScalaReflectionLock and Data type initialization
> -
>
> Key: SPARK-3267
> URL: https://issues.apache.org/jira/browse/SPARK-3267
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
>Reporter: Aaron Davidson
>Priority: Critical
>
> Deadlock here:
> {code}
> "Executor task launch worker-0" daemon prio=10 tid=0x7fab50036000 
> nid=0x27a in Object.wait() [0x7fab60c2e000
> ]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:202)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
> a:175)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:304)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:314)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
> 93)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:313)
> at 
> org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
> a:195)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
> at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
> ...
> {code}
> and
> {code}
> "Executor task launch worker-2" daemon prio=10 tid=0x7fab100f0800 
> nid=0x27e in Object.wait() [0x7fab0eeec000
> ]
>java.lang.Thread.State: RUNNABLE
> at 
> org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
> - locked <0x00064e5d9a48> (a 
> org.apache.spark.sql.catalyst.expressions.Cast)
> at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
> at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
> scala:139)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
> scala:139)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> at scala.collection.AbstractTraversable.map(Traversable.scala:105)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
> at 
> org.apache.spark.sql.parquet.ParquetTableScan$$anonfu

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14143792#comment-14143792
 ] 

Aaron Davidson commented on SPARK-3032:
---

[~matei] any thoughts on this issue?

> Potential bug when running sort-based shuffle with sorting using TimSort
> 
>
> Key: SPARK-3032
> URL: https://issues.apache.org/jira/browse/SPARK-3032
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
>
> When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
> data type for key and value is (String, String), always meet this issue:
> {noformat}
> java.lang.IllegalArgumentException: Comparison method violates its general 
> contract!
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
> at 
> org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
> at 
> org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
> at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
> at 
> org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
> at 
> org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
> at 
> org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
> at 
> org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
> at 
> org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
> at 
> org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
> at java.lang.Thread.run(Thread.java:722)
> {noformat}
> Seems the current partitionKeyComparator which use hashcode of String as key 
> comparator break some sorting contracts. 
> Also I tested using data type Int as key, this is OK to pass the test, since 
> hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
> of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3333) Large number of partitions causes OOM

2014-09-01 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14117903#comment-14117903
 ] 

Aaron Davidson commented on SPARK-:
---

Anyone have a jmap on the driver during this time? Perhaps also a GC log? To be 
clear, a "jmap -histo:live" and a "jmap -histo" should be sufficient over a 
full heap dump. I wonder if something changed with the MapOutputTracker, since 
that state grows with O(M * R), which is extremely high in this situation.

> Large number of partitions causes OOM
> -
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: EC2; 1 master; 1 slave; {{m3.xlarge}} instances
>Reporter: Nicholas Chammas
>
> Here’s a repro for PySpark:
> {code}
> a = sc.parallelize(["Nick", "John", "Bob"])
> a = a.repartition(24000)
> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> {code}
> This code runs fine on 1.0.2. It returns the following result in just over a 
> minute:
> {code}
> [(4, 'NickJohn')]
> {code}
> However, when I try this with 1.1.0-rc3 on an identically-sized cluster, it 
> runs for a very, very long time (at least > 45 min) and then fails with 
> {{java.lang.OutOfMemoryError: Java heap space}}.
> Here is a stack trace taken from a run on 1.1.0-rc2:
> {code}
> >>> a = sc.parallelize(["Nick", "John", "Bob"])
> >>> a = a.repartition(24000)
> >>> a.keyBy(lambda x: len(x)).reduceByKey(lambda x,y: x + y).take(1)
> 14/08/29 21:53:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(0, ip-10-138-29-167.ec2.internal, 46252, 0) with no recent 
> heart beats: 175143ms exceeds 45000ms
> 14/08/29 21:53:50 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(10, ip-10-138-18-106.ec2.internal, 33711, 0) with no recent 
> heart beats: 175359ms exceeds 45000ms
> 14/08/29 21:54:02 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(19, ip-10-139-36-207.ec2.internal, 52208, 0) with no recent 
> heart beats: 173061ms exceeds 45000ms
> 14/08/29 21:54:13 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(5, ip-10-73-142-70.ec2.internal, 56162, 0) with no recent 
> heart beats: 176816ms exceeds 45000ms
> 14/08/29 21:54:22 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(7, ip-10-236-145-200.ec2.internal, 40959, 0) with no recent 
> heart beats: 182241ms exceeds 45000ms
> 14/08/29 21:54:40 WARN BlockManagerMasterActor: Removing BlockManager 
> BlockManagerId(4, ip-10-139-1-195.ec2.internal, 49221, 0) with no recent 
> heart beats: 178406ms exceeds 45000ms
> 14/08/29 21:54:41 ERROR Utils: Uncaught exception in thread Result resolver 
> thread-3
> java.lang.OutOfMemoryError: Java heap space
> at com.esotericsoftware.kryo.io.Input.readBytes(Input.java:296)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:35)
> at 
> com.esotericsoftware.kryo.serializers.DefaultArraySerializers$ByteArraySerializer.read(DefaultArraySerializers.java:18)
> at com.esotericsoftware.kryo.Kryo.readObjectOrNull(Kryo.java:699)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.read(FieldSerializer.java:611)
> at 
> com.esotericsoftware.kryo.serializers.FieldSerializer.read(FieldSerializer.java:221)
> at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:729)
> at 
> org.apache.spark.serializer.KryoSerializerInstance.deserialize(KryoSerializer.scala:162)
> at org.apache.spark.scheduler.DirectTaskResult.value(TaskResult.scala:79)
> at 
> org.apache.spark.scheduler.TaskSetManager.handleSuccessfulTask(TaskSetManager.scala:514)
> at 
> org.apache.spark.scheduler.TaskSchedulerImpl.handleSuccessfulTask(TaskSchedulerImpl.scala:355)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:68)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2$$anonfun$run$1.apply(TaskResultGetter.scala:47)
> at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
> at 
> org.apache.spark.scheduler.TaskResultGetter$$anon$2.run(TaskResultGetter.scala:46)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> Exception in thread "Result resolver thread-3" 14/08/29 21:56:26 ERROR 
> SendingConnection: Exception while reading SendingConnection to 
> ConnectionManagerId(ip-10-73-142-

[jira] [Created] (SPARK-3267) Deadlock between ScalaReflectionLock and Data type initialization

2014-08-27 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3267:
-

 Summary: Deadlock between ScalaReflectionLock and Data type 
initialization
 Key: SPARK-3267
 URL: https://issues.apache.org/jira/browse/SPARK-3267
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Aaron Davidson


Deadlock here:

{code}
"Executor task launch worker-0" daemon prio=10 tid=0x7fab50036000 nid=0x27a 
in Object.wait() [0x7fab60c2e000
]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.defaultPrimitive(CodeGenerator.scala:565)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:202)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:195)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
93)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$Evaluate2$2.evaluateAs(CodeGenerator.scal
a:175)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:304)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:195)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
93)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:314)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:195)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator.expressionEvaluator(CodeGenerator.scala:4
93)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:313)
at 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anonfun$1.applyOrElse(CodeGenerator.scal
a:195)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:218)
at scala.PartialFunction$Lifted.apply(PartialFunction.scala:214)
...
{code}

and

{code}
"Executor task launch worker-2" daemon prio=10 tid=0x7fab100f0800 nid=0x27e 
in Object.wait() [0x7fab0eeec000
]
   java.lang.Thread.State: RUNNABLE
at 
org.apache.spark.sql.catalyst.expressions.Cast.cast$lzycompute(Cast.scala:250)
- locked <0x00064e5d9a48> (a 
org.apache.spark.sql.catalyst.expressions.Cast)
at org.apache.spark.sql.catalyst.expressions.Cast.cast(Cast.scala:247)
at org.apache.spark.sql.catalyst.expressions.Cast.eval(Cast.scala:263)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
scala:139)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2$$anonfun$6.apply(ParquetTableOperations.
scala:139)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:139)
at 
org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$2.apply(ParquetTableOperations.scala:126)
at 
org.apache.spark.rdd.NewHadoopRDD$NewHadoopMapPartitionsWithSplitRDD.compute(NewHadoopRDD.scala:197)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Ta

[jira] [Created] (SPARK-3236) Reading Parquet tables from Metastore mangles location

2014-08-26 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3236:
-

 Summary: Reading Parquet tables from Metastore mangles location
 Key: SPARK-3236
 URL: https://issues.apache.org/jira/browse/SPARK-3236
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently we do "relation.hiveQlTable.getDataLocation.getPath", which returns 
the path-part of the URI (e.g., s3n://my-bucket/my-path => /my-path). We should 
do "relation.hiveQlTable.getDataLocation.toString" instead, as a URI's toString 
returns a faithful representation of the full URI, which can later be passed 
into a Hadoop Path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3093) masterLock in Worker is no longer need

2014-08-18 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-3093.
---

  Resolution: Fixed
Assignee: Chen Chao
Target Version/s: 1.2.0

> masterLock in Worker is no longer need
> --
>
> Key: SPARK-3093
> URL: https://issues.apache.org/jira/browse/SPARK-3093
> Project: Spark
>  Issue Type: Improvement
>Reporter: Chen Chao
>Assignee: Chen Chao
>
> there's no need to use masterLock in Worker now since all communications are 
> within Akka actor



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3029) Disable local execution of Spark jobs by default

2014-08-13 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-3029:
-

 Summary: Disable local execution of Spark jobs by default
 Key: SPARK-3029
 URL: https://issues.apache.org/jira/browse/SPARK-3029
 Project: Spark
  Issue Type: Improvement
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Currently, local execution of Spark jobs is only used by take(), and it can be 
problematic as it can load a significant amount of data onto the driver. The 
worst case scenarios occur if the RDD is cached (guaranteed to load whole 
partition), has very large elements, or the partition is just large and we 
apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the web 
UI, and are thus harder to track or understand what is occurring.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2973) Add a way to show tables without executing a job

2014-08-11 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2973:
-

 Summary: Add a way to show tables without executing a job
 Key: SPARK-2973
 URL: https://issues.apache.org/jira/browse/SPARK-2973
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson


Right now, sql("show tables").collect() will start a Spark job which shows up 
in the UI. There should be a way to get these without this step.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2936) Migrate Netty network module from Java to Scala

2014-08-10 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2936.
---

Resolution: Fixed

> Migrate Netty network module from Java to Scala
> ---
>
> Key: SPARK-2936
> URL: https://issues.apache.org/jira/browse/SPARK-2936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The netty network module was originally written when Scala 2.9.x had a bug 
> that prevents a pure Scala implementation, and a subset of the files were 
> done in Java. We have since upgraded to Scala 2.10, and can migrate all Java 
> files now to Scala.
> https://github.com/netty/netty/issues/781
> https://github.com/mesos/spark/pull/522



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2949) SparkContext does not fate-share with ActorSystem

2014-08-09 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2949:
-

 Summary: SparkContext does not fate-share with ActorSystem
 Key: SPARK-2949
 URL: https://issues.apache.org/jira/browse/SPARK-2949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson


It appears that an uncaught fatal error in Spark's Driver ActorSystem does not 
cause the SparkContext to terminate. We observed an issue in production that 
caused a PermGen error, but it just kept throwing this error:

{code}
14/08/09 15:07:24 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-26] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: PermGen space
{code}

We should probably do something similar for what we did in the DAGSCheduler and 
ensure that we call SparkContext#stop() if the entire ActorSystem dies with a 
fatal error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2825) Allow creating external tables in metastore

2014-08-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2825:
-

 Summary: Allow creating external tables in metastore
 Key: SPARK-2825
 URL: https://issues.apache.org/jira/browse/SPARK-2825
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Davidson


External tables are useful for creating a metastore entry that points at 
pre-existing data. There is an easy workaround to use registerAsTable within 
SparkSQL itself, but this is only transient, and has to be re-run for every 
Spark session.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2824) Allow saving Parquet files to the HiveMetastore

2014-08-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2824:
-

 Summary: Allow saving Parquet files to the HiveMetastore
 Key: SPARK-2824
 URL: https://issues.apache.org/jira/browse/SPARK-2824
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Aaron Davidson


Currently, we use one code path for reading all data from the Hive Metastore, 
and this precludes writing or loading data as custom ParquetRelations (they are 
always MetastoreRelations).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2557) createTaskScheduler should be consistent between local and local-n-failures

2014-08-01 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2557.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

> createTaskScheduler should be consistent between local and local-n-failures 
> 
>
> Key: SPARK-2557
> URL: https://issues.apache.org/jira/browse/SPARK-2557
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Ye Xianjin
>Priority: Minor
>  Labels: starter
> Fix For: 1.1.0
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> In SparkContext.createTaskScheduler, we can use {code}local[*]{code} to 
> estimates the number of cores on the machine. I think we should also be able 
> to use * in the local-n-failures mode.
> And according to the code in the LOCAL_N_REGEX pattern matching code, I 
> believe the regular expression of LOCAL_N_REGEX is wrong. LOCAL_N_REFEX 
> should be 
> {code}
> """local\[([0-9]+|\*)\]""".r
> {code} 
> rather than
> {code}
>  """local\[([0-9\*]+)\]""".r
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-07-28 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1860:
--

Summary: Standalone Worker cleanup should not clean up running executors  
(was: Standalone Worker cleanup should not clean up running applications)

> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
> Fix For: 1.1.0
>
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> applications that happen to be running for longer than 7 days, hitting 
> streaming jobs especially hard.
> Applications should not be cleaned up if they're still running. Until then, 
> this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1860) Standalone Worker cleanup should not clean up running executors

2014-07-28 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1860:
--

Description: 
The default values of the standalone worker cleanup code cleanup all 
application data every 7 days. This includes jars that were added to any 
executors that happen to be running for longer than 7 days, hitting streaming 
jobs especially hard.

Executor's log/data folders should not be cleaned up if they're still running. 
Until then, this behavior should not be enabled by default.

  was:
The default values of the standalone worker cleanup code cleanup all 
application data every 7 days. This includes jars that were added to any 
applications that happen to be running for longer than 7 days, hitting 
streaming jobs especially hard.

Applications should not be cleaned up if they're still running. Until then, 
this behavior should not be enabled by default.


> Standalone Worker cleanup should not clean up running executors
> ---
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
> Fix For: 1.1.0
>
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> executors that happen to be running for longer than 7 days, hitting streaming 
> jobs especially hard.
> Executor's log/data folders should not be cleaned up if they're still 
> running. Until then, this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1860) Standalone Worker cleanup should not clean up running applications

2014-07-28 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14076403#comment-14076403
 ] 

Aaron Davidson commented on SPARK-1860:
---

There's not an easy way to tell if an application is still running. However, 
the Worker has state about which executors are still running. This is really 
what I intended originally -- we must not clean up an Executor's own state from 
underneath it. I will change the title to reflect this intention.

> Standalone Worker cleanup should not clean up running applications
> --
>
> Key: SPARK-1860
> URL: https://issues.apache.org/jira/browse/SPARK-1860
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Priority: Critical
> Fix For: 1.1.0
>
>
> The default values of the standalone worker cleanup code cleanup all 
> application data every 7 days. This includes jars that were added to any 
> applications that happen to be running for longer than 7 days, hitting 
> streaming jobs especially hard.
> Applications should not be cleaned up if they're still running. Until then, 
> this behavior should not be enabled by default.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075744#comment-14075744
 ] 

Aaron Davidson commented on SPARK-2707:
---

That doesn't sound like a bad idea -- actually sounds significantly more 
straightforward than depending on a version of Akka that only shades the 
internal protobuf usage.

> Upgrade to Akka 2.3
> ---
>
> Key: SPARK-2707
> URL: https://issues.apache.org/jira/browse/SPARK-2707
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Yardena
>
> Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
> features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075652#comment-14075652
 ] 

Aaron Davidson commented on SPARK-2707:
---

It does sound mostly mechanical and I believe we don't use most of those 
features. Perhaps just getting it to compile (while re-shading protobuf) would 
be sufficient to make it work.

> Upgrade to Akka 2.3
> ---
>
> Key: SPARK-2707
> URL: https://issues.apache.org/jira/browse/SPARK-2707
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Yardena
>
> Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
> features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2707) Upgrade to Akka 2.3

2014-07-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14075645#comment-14075645
 ] 

Aaron Davidson commented on SPARK-2707:
---

Are there any anticipated issues in upgrading from 2.2 to 2.3? Are there source 
or behaviorally incompatible changes? 

> Upgrade to Akka 2.3
> ---
>
> Key: SPARK-2707
> URL: https://issues.apache.org/jira/browse/SPARK-2707
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Yardena
>
> Upgrade Akka from 2.2 to 2.3. We want to be able to use new Akka and Spray 
> features directly in the same project.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-1264) Documentation for setting heap sizes across all configurations

2014-07-24 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-1264:
--

Assignee: (was: Aaron Davidson)

> Documentation for setting heap sizes across all configurations
> --
>
> Key: SPARK-1264
> URL: https://issues.apache.org/jira/browse/SPARK-1264
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> As a user, there are lots of places to configure heap sizes, and it takes a 
> bit of trial and error to figure out how to configure what you want.
> We need some more clear documentation on how set these for the cross product 
> of Spark components (master, worker, executor, driver, shell) and deployment 
> modes (Standalone, YARN, Mesos, EC2?).
> I'm happy to do the authoring if someone can help pull together the relevant 
> details.
> Here's the best I've got so far:
> {noformat}
> # Standalone cluster
> Master - SPARK_DAEMON_MEMORY - default: 512mb
> Worker - SPARK_DAEMON_MEMORY vs SPARK_WORKER_MEMORY? - default: ?  See 
> WorkerArguments.inferDefaultMemory()
> Executor - spark.executor.memory
> Driver - SPARK_DRIVER_MEMORY - default: 512mb
> Shell - A pre-built driver so SPARK_DRIVER_MEMORY - default: 512mb
> # EC2 cluster
> Master - ?
> Worker - ?
> Executor - ?
> Driver - ?
> Shell - ?
> # Mesos cluster
> Master - SPARK_DAEMON_MEMORY
> Worker - SPARK_DAEMON_MEMORY
> Executor - SPARK_EXECUTOR_MEMORY
> Driver - SPARK_DRIVER_MEMORY
> Shell - A pre-built driver so SPARK_DRIVER_MEMORY
> # YARN cluster
> Master - SPARK_MASTER_MEMORY ?
> Worker - SPARK_WORKER_MEMORY ?
> Executor - SPARK_EXECUTOR_MEMORY
> Driver - SPARK_DRIVER_MEMORY
> Shell - A pre-built driver so SPARK_DRIVER_MEMORY
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2660) Enable pretty-printing SchemaRDD Rows

2014-07-23 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2660:
-

 Summary: Enable pretty-printing SchemaRDD Rows
 Key: SPARK-2660
 URL: https://issues.apache.org/jira/browse/SPARK-2660
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Aaron Davidson


Right now, "printing a Row" results in something like "[a,b,c]". It would be 
cool if there was a way to pretty-print them similar to JSON, into something 
like

{code}
{
  Col1 : a,
  Col2 : b,
  Col3 : c
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071192#comment-14071192
 ] 

Aaron Davidson commented on SPARK-2282:
---

[~pwendell] That would in general be the right solution, but this particular 
change hasn't been merged yet (referring to the second PR on this bug, which is 
a more complete fix).

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14070528#comment-14070528
 ] 

Aaron Davidson commented on SPARK-2282:
---

Great to hear! These files haven't been changed since the 1.0.1 release besides 
this patch, so it should be fine to just drop them in. (A generally safer 
option would be to do a git merge, though, against Spark's refs/pull/1503/head 
branch.)

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-1767) Prefer HDFS-cached replicas when scheduling data-local tasks

2014-07-21 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997761#comment-13997761
 ] 

Aaron Davidson edited comment on SPARK-1767 at 7/21/14 7:46 PM:


-One simple workaround to this is to just make sure that partitions that are in 
memory are ordered first in the list of partitions, as Spark will try to place 
executors based on the order in this list.- This is, of course, not a complete 
solution, as we would not utilize the locality-wait logic within Spark and 
would immediately fallback to a non-cached node if the cached node was busy, 
rather than waiting for some period of time for the cached node to become 
available.

Edit: I was wrong about how Spark schedules partitions -- ordering is not 
sufficient.


was (Author: ilikerps):
One simple workaround to this is to just make sure that partitions that are in 
memory are ordered first in the list of partitions, as Spark will try to place 
executors based on the order in this list. This is, of course, not a complete 
solution, as we would not utilize the locality-wait logic within Spark and 
would immediately fallback to a non-cached node if the cached node was busy, 
rather than waiting for some period of time for the cached node to become 
available.

> Prefer HDFS-cached replicas when scheduling data-local tasks
> 
>
> Key: SPARK-1767
> URL: https://issues.apache.org/jira/browse/SPARK-1767
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-20 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14068083#comment-14068083
 ] 

Aaron Davidson commented on SPARK-2282:
---

Hey Ken,

I created [PR 1503|https://github.com/apache/spark/pull/1503] to implement the 
solution I mentioned. It would be great if you could try testing out this patch 
on your cluster.

While testing, I noticed that the ephemeral ports were still growing with 
number of tasks due to how we launch new tasks on the PySpark daemon. However, 
this should only affect workers, and the rate of buildup should be divided by 
the number of workers. In other words, it should only ever be a problem on a 
very small cluster.

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065306#comment-14065306
 ] 

Aaron Davidson commented on SPARK-2282:
---

This problem is kinda silly because we're accumulating these updates from a 
single thread in the DAGScheduler, so we should only really have one socket 
open at a time, but it's very short lived. We could just reuse the connection 
with a relatively minor refactor of accumulators.py and PythonAccumulatorParam.

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-17 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065121#comment-14065121
 ] 

Aaron Davidson commented on SPARK-2282:
---

This problem does look identical. I think I gave you the wrong netstat command, 
as "-l" only show listening sockets. Try with "-a" instead to see all open 
connections to confirm this, but the rest of your symptoms align perfectly.

I did a little Googling around for your specific kernel version, and it turns 
out [someone else|http://lists.openwall.net/netdev/2011/07/13/39] has had 
success with tcp_tw_recycle on 2.6.32. Could you try to make absolutely sure 
that the sysctl is taking effect? Perhaps you can try adding 
"net.ipv4.tcp_tw_recycle = 1" to /etc/sysctl.conf and then running a "sysctl 
-p" before restarting pyspark.

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2545) Add a diagnosis mode for closures to figure out what they're bringing in

2014-07-16 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2545:
-

 Summary: Add a diagnosis mode for closures to figure out what 
they're bringing in
 Key: SPARK-2545
 URL: https://issues.apache.org/jira/browse/SPARK-2545
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Aaron Davidson


Today, it's pretty hard to figure out why your closure is bigger than expected, 
because it's not obvious what objects are being included or who is including 
them. We should have some sort of diagnosis available to users with very large 
closures that displays the contents of the closure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-07-15 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14063211#comment-14063211
 ] 

Aaron Davidson commented on SPARK-2282:
---

This should actually only be necessary on the master. Use of the SO_REUSEADDR 
property (equivalently, sysctl tcp_tw_reuse) means that the number of used 
sockets will increase to the maximum number of ephemeral ports, but then should 
remain constant. It's possible that if another process tries to allocate an 
ephemeral port during this time, it will fail.

While tcp_tw_reuse is generally considered "safe", setting *tcp_tw_recycle* can 
lead to unexpected packet arrival from closed streams (though it's very 
unlikely), but is a more guaranteed solution. This should cause the connections 
to be recycled immediately after the TCP teardown, and thus no buildup of 
sockets should occur.

Please let me know if setting either of these parameters helps on the driver 
machine. You can also verify that this problem is occurring by doing a 
{{netstat -lpn}} during execution, iirc, which should display an inordinate 
number of open connections on the Spark Driver process and on a Python daemon 
one.

> PySpark crashes if too many tasks complete quickly
> --
>
> Key: SPARK-2282
> URL: https://issues.apache.org/jira/browse/SPARK-2282
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 0.9.1, 1.0.0, 1.0.1
>Reporter: Aaron Davidson
>Assignee: Aaron Davidson
> Fix For: 0.9.2, 1.0.0, 1.0.1
>
>
> Upon every task completion, PythonAccumulatorParam constructs a new socket to 
> the Accumulator server running inside the pyspark daemon. This can cause a 
> buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
> stage, which will cause the SparkContext to crash if too many tasks complete 
> too quickly. We ran into this bug with 17k tasks completing in 15 seconds.
> This bug can be fixed outside of Spark by ensuring these properties are set 
> (on a linux server);
> echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
> echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle
> or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2485) Usage of HiveClient not threadsafe.

2014-07-15 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2485.
---

Resolution: Fixed

https://github.com/apache/spark/pull/1412

> Usage of HiveClient not threadsafe.
> ---
>
> Key: SPARK-2485
> URL: https://issues.apache.org/jira/browse/SPARK-2485
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Michael Armbrust
>
> When making concurrent queries against the hive metastore, sometimes we get 
> an exception that includes the following stack trace: 
> {code}
> Caused by: java.lang.Throwable: get_table failed: out of sequence response
>   at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:76)
>   at 
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table(ThriftHiveMetastore.java:936)
> {code}
> Likely, we need to synchronize our use of HiveClient.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2154) Worker goes down.

2014-07-14 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14060813#comment-14060813
 ] 

Aaron Davidson commented on SPARK-2154:
---

Created this PR to hopefully fix that: https://github.com/apache/spark/pull/1405

> Worker goes down.
> -
>
> Key: SPARK-2154
> URL: https://issues.apache.org/jira/browse/SPARK-2154
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.8.1, 0.9.0, 1.0.0
> Environment: Spark on cluster of three nodes on Ubuntu 12.04.4 LTS
>Reporter: siva venkat gogineni
>  Labels: patch
> Attachments: Sccreenhot at various states of driver ..jpg
>
>
> Worker dies when i try to submit drivers more than the allocated cores. When 
> I submit 9 drivers with one core for each driver on a cluster having 8 cores 
> all together the worker dies as soon as i submit the 9 the driver. It works 
> fine until it reaches 8 cores, As soon as i submit 9th driver the driver 
> status remains "Submitted" and the worker crashes. I understand that we 
> cannot run  drivers more than the allocated cores but the problem here is 
> instead of the 9th driver being in queue it is being executed and as a result 
> it is crashing the worker. Let me know if there is a way to get around this 
> issue or is it being fixed in the upcoming version?
> Cluster Details:
> Spark 1.00
> 2 nodes with 4 cores each.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (SPARK-2453) Compound lines in spark-shell cause compilation errors

2014-07-11 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson closed SPARK-2453.
-

Resolution: Duplicate

https://issues.apache.org/jira/browse/SPARK-2452

> Compound lines in spark-shell cause compilation errors
> --
>
> Key: SPARK-2453
> URL: https://issues.apache.org/jira/browse/SPARK-2453
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: Aaron Davidson
>Priority: Critical
>
> repro:
> {code}
> scala> val x = 3; def z() = x + 4
> x: Int = 3
> z: ()Int
> scala> z()
> :11: error: $VAL5 is already defined as value $VAL5
> val $VAL5 = INSTANCE;
> {code}
> Everything's cool if the def is put on its own line.
> This was caused by 
> https://github.com/apache/spark/commit/d43415075b3468fe8aa56de5d2907d409bb96347
>  ([~prashant_]).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2453) Compound lines in spark-shell cause compilation errors

2014-07-11 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2453:
-

 Summary: Compound lines in spark-shell cause compilation errors
 Key: SPARK-2453
 URL: https://issues.apache.org/jira/browse/SPARK-2453
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Aaron Davidson
Priority: Critical


repro:

{code}
scala> val x = 3; def z() = x + 4
x: Int = 3
z: ()Int

scala> z()
:11: error: $VAL5 is already defined as value $VAL5
val $VAL5 = INSTANCE;
{code}

Everything's cool if the def is put on its own line.

This was caused by 
https://github.com/apache/spark/commit/d43415075b3468fe8aa56de5d2907d409bb96347 
([~prashant_]).



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2412) CoalescedRDD throws exception with certain pref locs

2014-07-08 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2412:
-

 Summary: CoalescedRDD throws exception with certain pref locs
 Key: SPARK-2412
 URL: https://issues.apache.org/jira/browse/SPARK-2412
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson


If the first pass of CoalescedRDD does not find the target number of locations 
AND the second pass finds new locations, an exception is thrown, as 
"groupHash.get(nxt_replica).get" is not valid.

The fix is just to add an ArrayBuffer to groupHash for that replica if it 
didn't already exist.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2403) Spark stuck when class is not registered with Kryo

2014-07-08 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2403?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2403.
---

   Resolution: Fixed
Fix Version/s: 1.0.2
   1.1.0

> Spark stuck when class is not registered with Kryo
> --
>
> Key: SPARK-2403
> URL: https://issues.apache.org/jira/browse/SPARK-2403
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Daniel Darabos
> Fix For: 1.1.0, 1.0.2
>
>
> We are using Kryo and require registering classes. When trying to serialize 
> something containing an unregistered class, Kryo will raise an exception.
> DAGScheduler.submitMissingTasks runs in the scheduler thread and checks if 
> the contents of the task can be serialized by trying to serialize it:
> https://github.com/apache/spark/blob/v1.0.0/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L767
> It catches NotSerializableException and aborts the task with an error when 
> this happens.
> The problem is, Kryo does not raise NotSerializableException for unregistered 
> classes. It raises IllegalArgumentException instead. This exception is not 
> caught and kills the scheduler thread. The application then hangs, waiting 
> indefinitely for the job to finish.
> Catching IllegalArgumentException also is a quick fix. I'll send a pull 
> request for it if you agree. Thanks!



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2349) Fix NPE in ExternalAppendOnlyMap

2014-07-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2349.
---

Resolution: Fixed

https://github.com/apache/spark/pull/1288

> Fix NPE in ExternalAppendOnlyMap
> 
>
> Key: SPARK-2349
> URL: https://issues.apache.org/jira/browse/SPARK-2349
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Andrew Or
>
> It throws an NPE on null keys.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2324) SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error

2014-07-03 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2324.
---

Resolution: Fixed

Resolved by https://github.com/apache/spark/pull/1274

> SparkContext should not exit directly when spark.local.dir is a list of 
> multiple paths and one of them has error
> 
>
> Key: SPARK-2324
> URL: https://issues.apache.org/jira/browse/SPARK-2324
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: YanTang Zhai
>
> The spark.local.dir is configured as a list of multiple paths as follows 
> /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver 
> node has error, the application will exit since DiskBlockManager exits 
> directly at createLocalDirs. If the disk data2 of the worker node has error, 
> the executor will exit either.
> DiskBlockManager should not exit directly at createLocalDirs if one of 
> spark.local.dir has error. Since spark.local.dir has multiple paths, a 
> problem should not affect the overall situation.
> I think DiskBlockManager could ignore the bad directory at createLocalDirs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2282) PySpark crashes if too many tasks complete quickly

2014-06-25 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2282:
-

 Summary: PySpark crashes if too many tasks complete quickly
 Key: SPARK-2282
 URL: https://issues.apache.org/jira/browse/SPARK-2282
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0, 1.0.1
Reporter: Aaron Davidson
Assignee: Aaron Davidson


Upon every task completion, PythonAccumulatorParam constructs a new socket to 
the Accumulator server running inside the pyspark daemon. This can cause a 
buildup of used ephemeral ports from sockets in the TIME_WAIT termination 
stage, which will cause the SparkContext to crash if too many tasks complete 
too quickly. We ran into this bug with 17k tasks completing in 15 seconds.

This bug can be fixed outside of Spark by ensuring these properties are set (on 
a linux server);
echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse
echo "1" > /proc/sys/net/ipv4/tcp_tw_recycle

or by adding the SO_REUSEADDR option to the Socket creation within Spark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2203) PySpark does not infer default numPartitions in same way as Spark

2014-06-19 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2203:
-

 Summary: PySpark does not infer default numPartitions in same way 
as Spark
 Key: SPARK-2203
 URL: https://issues.apache.org/jira/browse/SPARK-2203
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson


For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark 
will always assume that the default parallelism to use for the reduce side is 
ctx.defaultParallelism, which is a constant typically determined by the number 
of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same number of 
reduce partitions as map partitions unless the defaultParallelism config is 
explicitly set. This tends to be a better default in order to avoid OOMs, and 
should also be the behavior of PySpark.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2147) Master UI forgets about Executors when application exits cleanly

2014-06-17 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2147.
---

Resolution: Fixed

https://github.com/apache/spark/pull/1102

> Master UI forgets about Executors when application exits cleanly
> 
>
> Key: SPARK-2147
> URL: https://issues.apache.org/jira/browse/SPARK-2147
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Aaron Davidson
>Assignee: Andrew Or
>
> When an application exits cleanly, the Master will remove all executors from 
> the application's ApplicationInfo, causing the historic "Completed 
> Applications" page to report that there were no executors associated with 
> that application. 
> On the contrary, if the application exits uncleanly, then the Master will 
> remove the application FIRST, and will not actually remove the executors from 
> the ApplicationInfo page. This causes the executors to show up correctly in 
> the "Completed Applications" page.
> The correct behavior would probably be to gather a history of all executors 
> (so we'd retain executors that we had at one point but were removed during 
> the job), and not remove lost executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-2161) UI should remember executors that have been removed

2014-06-17 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-2161.
---

Resolution: Fixed

https://github.com/apache/spark/pull/1102

> UI should remember executors that have been removed
> ---
>
> Key: SPARK-2161
> URL: https://issues.apache.org/jira/browse/SPARK-2161
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Andrew Or
> Fix For: 1.0.1
>
>
> This applies to all of SparkUI, MasterWebUI, and WorkerWebUI. If an executor 
> fails, it just disappears from these UIs. It would be helpful if you can see 
> the logs for why they failed on the UIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-15 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-937:
-

Fix Version/s: 1.0.1

> Executors that exit cleanly should not have KILLED status
> -
>
> Key: SPARK-937
> URL: https://issues.apache.org/jira/browse/SPARK-937
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.7.3
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.0.1, 1.1.0
>
>
> This is an unintuitive and overloaded status message when Executors are 
> killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-937) Executors that exit cleanly should not have KILLED status

2014-06-15 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-937.
--

Resolution: Fixed

> Executors that exit cleanly should not have KILLED status
> -
>
> Key: SPARK-937
> URL: https://issues.apache.org/jira/browse/SPARK-937
> Project: Spark
>  Issue Type: Improvement
>Affects Versions: 0.7.3
>Reporter: Aaron Davidson
>Assignee: Kan Zhang
>Priority: Critical
> Fix For: 1.1.0
>
>
> This is an unintuitive and overloaded status message when Executors are 
> killed during normal termination of an application.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2147) Master UI forgets about Executors when application exits cleanly

2014-06-14 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2147:
-

 Summary: Master UI forgets about Executors when application exits 
cleanly
 Key: SPARK-2147
 URL: https://issues.apache.org/jira/browse/SPARK-2147
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Andrew Or


When an application exists cleanly, the Master will remove all executors from 
the application's ApplicationInfo, causing the historic "Completed 
Applications" page to report that there were no executors associated with that 
application. 

On the contrary, if the application exits uncleanly, then the Master will 
remove the application FIRST, and will not actually remove the executors from 
the ApplicationInfo page. This causes the executors to show up correctly in the 
"Completed Applications" page.

The correct behavior would probably be to gather a history of all executors (so 
we'd retain executors that we had at one point but were removed during the 
job), and not remove lost executors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-06-12 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14029498#comment-14029498
 ] 

Aaron Davidson commented on SPARK-983:
--

The idea for SizeTrackingAppendOnlyMap is that we can estimate the size of an 
object with SizeEstimator, but doing so can be relatively costly. Rather than 
running this estimation on every element added to the map, we amortize the cost 
by sampling exponentially less often (similar to amortization of adding to an 
ArrayList).

If you expect your sublists to be relatively large, it may be perfectly fine to 
just call SizeEstimator on each one.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2063) Creating a SchemaRDD via sql() does not correctly resolve nested types

2014-06-06 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2063:
-

 Summary: Creating a SchemaRDD via sql() does not correctly resolve 
nested types
 Key: SPARK-2063
 URL: https://issues.apache.org/jira/browse/SPARK-2063
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Michael Armbrust


For example, from the typical twitter dataset:

{code}
scala> val popularTweets = sql("SELECT retweeted_status.text, 
MAX(retweeted_status.retweet_count) AS s FROM tweets WHERE retweeted_status is 
not NULL GROUP BY retweeted_status.text ORDER BY s DESC LIMIT 30")

scala> popularTweets.toString
14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch 
MultiInstanceRelations
14/06/06 21:27:48 INFO analysis.Analyzer: Max iterations (2) reached for batch 
CaseInsensitiveAttributeReferences
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to 
qualifiers on unresolved object, tree: 'retweeted_status.text
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:51)
at 
org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.qualifiers(unresolved.scala:47)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:67)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$2.apply(LogicalPlan.scala:65)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at 
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:65)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:100)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3$$anonfun$applyOrElse$2.applyOrElse(Analyzer.scala:97)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionDown$1(QueryPlan.scala:51)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1$$anonfun$apply$1.apply(QueryPlan.scala:65)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$1.apply(QueryPlan.scala:64)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressionsDown(QueryPlan.scala:69)
at 
org.apache.spark.sql.catalyst.plans.QueryPlan.transformExpressions(QueryPlan.scala:40)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:97)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$$anonfun$apply$3.applyOrElse(Analyzer.scala:94)
at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:217)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:94)
at 
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences$.apply(Analyzer.scala:93)
at 
org.apache.spark.sql.catalyst.rules.RuleExecutor$$anonfun$apply$1$$anonfun$apply$

[jira] [Created] (SPARK-2028) Users of HadoopRDD cannot access the partition InputSplits

2014-06-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2028:
-

 Summary: Users of HadoopRDD cannot access the partition InputSplits
 Key: SPARK-2028
 URL: https://issues.apache.org/jira/browse/SPARK-2028
 Project: Spark
  Issue Type: Bug
Reporter: Aaron Davidson
Assignee: Aaron Davidson


If a user creates a HadoopRDD (e.g., via textFile), there is no way to find out 
which file it came from, though this information is contained in the InputSplit 
within the RDD. We should find a way to expose this publicly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1899) Default log4j.properties incorrectly sends all output to stderr and none to stdout

2014-06-04 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1899.
---

Resolution: Won't Fix

See discussion at: https://github.com/apache/spark/pull/852

> Default log4j.properties incorrectly sends all output to stderr and none to 
> stdout
> --
>
> Key: SPARK-1899
> URL: https://issues.apache.org/jira/browse/SPARK-1899
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Ash
>
> Standard practice for unix applications is to send most output to stdout, and 
> errors to stderr.  The default log4j.properties file that Spark includes 
> incorrectly sends all output to stderr and none to stdout.  Update the 
> default log4j to split the output appropriately.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2027) spark-ec2 puts Hadoop's log4j ahead of Spark's in classpath

2014-06-04 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2027:
-

 Summary: spark-ec2 puts Hadoop's log4j ahead of Spark's in 
classpath
 Key: SPARK-2027
 URL: https://issues.apache.org/jira/browse/SPARK-2027
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Aaron Davidson
Assignee: Aaron Davidson






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (SPARK-1901) Standalone worker update exector's state ahead of executor process exit

2014-05-30 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1901?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson resolved SPARK-1901.
---

Resolution: Fixed

> Standalone worker update exector's state ahead of executor process exit
> ---
>
> Key: SPARK-1901
> URL: https://issues.apache.org/jira/browse/SPARK-1901
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 0.9.0
> Environment: spark-1.0 rc10
>Reporter: Zhen Peng
>Assignee: Zhen Peng
> Fix For: 1.0.1
>
>
> Standalone worker updates executor's state prematurely, making the resource 
> status in an inconsistent state until the executor process really died.
> In our cluster, we found this situation may cause new submitted applications 
> removed by Master for launching executor fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-1966) Cannot cancel tasks running locally

2014-05-29 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-1966:
-

 Summary: Cannot cancel tasks running locally
 Key: SPARK-1966
 URL: https://issues.apache.org/jira/browse/SPARK-1966
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 0.9.1, 1.0.0
Reporter: Aaron Davidson






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Comment Edited] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190
 ] 

Aaron Davidson edited comment on SPARK-983 at 5/27/14 9:54 PM:
---

Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

Edit: Figured it out, thanks [~pwendell]!


was (Author: ilikerps):
Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aaron Davidson updated SPARK-983:
-

Assignee: Madhu Siddalingaiah

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>Assignee: Madhu Siddalingaiah
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-27 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14010190#comment-14010190
 ] 

Aaron Davidson commented on SPARK-983:
--

Does sound reasonable. For some reason it does not allow me to assign the issue 
to you, though.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1855) Provide memory-and-local-disk RDD checkpointing

2014-05-26 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14009293#comment-14009293
 ] 

Aaron Davidson commented on SPARK-1855:
---

I agree that significant improvements can be made to Spark's block replication 
model, but there's no reason it shouldn't "work" (albeit with potentially poor 
write performance and fewer guarantees than one would like) if you increase the 
replication level higher than 2, which is possible using 
[StorageLevel#apply|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/StorageLevel.scala#L155].

> Provide memory-and-local-disk RDD checkpointing
> ---
>
> Key: SPARK-1855
> URL: https://issues.apache.org/jira/browse/SPARK-1855
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib, Spark Core
>Affects Versions: 1.0.0
>Reporter: Xiangrui Meng
>
> Checkpointing is used to cut long lineage while maintaining fault tolerance. 
> The current implementation is HDFS-based. Using the BlockRDD we can create 
> in-memory-and-local-disk (with replication) checkpoints that are not as 
> reliable as HDFS-based solution but faster.
> It can help applications that require many iterations.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-983) Support external sorting for RDD#sortByKey()

2014-05-25 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008557#comment-14008557
 ] 

Aaron Davidson commented on SPARK-983:
--

[~pwendell] or [~matei], any opinions on memory management best practices? 
Adding a new memoryFraction for sorting will only exacerbate the problems we 
see with them, but I'm not sure we can rely on Runtime.freeMemory() as even an 
intermediary solution. 

Perhaps this feature could draw from the same pool as shuffle.memoryFraction, 
as it's used for a similar purpose, and that pool already implements some 
notion of memory sharing.

> Support external sorting for RDD#sortByKey()
> 
>
> Key: SPARK-983
> URL: https://issues.apache.org/jira/browse/SPARK-983
> Project: Spark
>  Issue Type: New Feature
>Affects Versions: 0.9.0
>Reporter: Reynold Xin
>
> Currently, RDD#sortByKey() is implemented by a mapPartitions which creates a 
> buffer to hold the entire partition, then sorts it. This will cause an OOM if 
> an entire partition cannot fit in memory, which is especially problematic for 
> skewed data. Rather than OOMing, the behavior should be similar to the 
> [ExternalAppendOnlyMap|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/util/collection/ExternalAppendOnlyMap.scala],
>  where we fallback to disk if we detect memory pressure.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


<    1   2   3   >