[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091684#comment-14091684
 ] 

Saisai Shao commented on SPARK-2926:


Hi Sandy,

Thanks a lot for your comments, basic idea is the same as MapReduce, but the 
implementation is  a little different to be compatible with Spark's code path.

Code may has some common parts with map-side merge, I think currently we should 
verify the pros and cons of this method, then we can do the refactoring to make 
it better.

For operation like groupByKey, I'm not sure the performance of by-key sorting 
is better than Aggregator way or not, currently I still use Aggregator in 
reduce side. We need to do some performance tests to verify this by-key 
sorting's necessity.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2938) Support SASL authentication in Netty network module

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2938:
--

 Summary: Support SASL authentication in Netty network module
 Key: SPARK-2938
 URL: https://issues.apache.org/jira/browse/SPARK-2938
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2939) Support fetching in-memory blocks for Netty network module

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2939:
--

 Summary: Support fetching in-memory blocks for Netty network module
 Key: SPARK-2939
 URL: https://issues.apache.org/jira/browse/SPARK-2939
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091685#comment-14091685
 ] 

Matei Zaharia commented on SPARK-2926:
--

Hey Saisai, a couple of questions about this:
- Doesn't the ExternalAppendOnlyMap used on the reduce side already do a 
merge-sort? You won't do much better than that unless you assume that an 
Ordering is given for the key, which isn't actually something we receive at the 
ShuffleWriter from our APIs so far (though we have the opportunity to pass it 
through in Scala).
- What kind of data did you test key sorting on the map side with? The cost of 
comparisons will depend a lot on the data type, but will be much higher for 
things like strings or tuples.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2468) zero-copy shuffle network communication

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2468:
---

Description: 
Right now shuffle send goes through the block manager. This is inefficient 
because it requires loading a block from disk into a kernel buffer, then into a 
user space buffer, and then back to a kernel send buffer before it reaches the 
NIC. It does multiple copies of the data and context switching between 
kernel/user. It also creates unnecessary buffer in the JVM that increases GC

Instead, we should use FileChannel.transferTo, which handles this in the kernel 
space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/

One potential solution is to use Netty.  Spark already has a Netty based 
network module implemented (org.apache.spark.network.netty). However, it lacks 
some functionality and is turned off by default. 




  was:
Right now shuffle send goes through the block manager. This is inefficient 
because it requires loading a block from disk into a kernel buffer, then into a 
user space buffer, and then back to a kernel send buffer before it reaches the 
NIC. It does multiple copies of the data and context switching between 
kernel/user. It also creates unnecessary buffer in the JVM that increases GC

Instead, we should use FileChannel.transferTo, which handles this in the kernel 
space with zero-copy. See http://www.ibm.com/developerworks/library/j-zerocopy/

One potential solution is to use Netty NIO.





> zero-copy shuffle network communication
> ---
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2940) Support fetching multiple blocks in a single request in Netty network module

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2940:
--

 Summary: Support fetching multiple blocks in a single request in 
Netty network module
 Key: SPARK-2940
 URL: https://issues.apache.org/jira/browse/SPARK-2940
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


ShuffleCopier.getBlock gets one block at a time (in one request).

{code}
  def getBlock(host: String, port: Int, blockId: BlockId,
  resultCollectCallback: (BlockId, Long, ByteBuf) => Unit) {

val handler = new ShuffleCopier.ShuffleClientHandler(resultCollectCallback)
val connectTimeout = conf.getInt("spark.shuffle.netty.connect.timeout", 
6)
val fc = new FileClient(handler, connectTimeout)

try {
  fc.init()
  fc.connect(host, port)
  fc.sendRequest(blockId.name)
  fc.waitForClose()
  fc.close()
} catch {
  // Handle any socket-related exceptions in FileClient
  case e: Exception => {
logError("Shuffle copy of block " + blockId + " from " + host + ":" + 
port + " failed", e)
handler.handleError(blockId)
  }
}
  }
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2941) Add config option to support NIO vs OIO in Netty network module

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2941:
--

 Summary: Add config option to support NIO vs OIO in Netty network 
module
 Key: SPARK-2941
 URL: https://issues.apache.org/jira/browse/SPARK-2941
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2942) Report error messages back from server to client

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2942:
--

 Summary: Report error messages back from server to client
 Key: SPARK-2942
 URL: https://issues.apache.org/jira/browse/SPARK-2942
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin


Similar to SPARK-2583, but for the Netty module.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2936) Migrate Netty network module from Java to Scala

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2936?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2936:
---

Issue Type: Sub-task  (was: Improvement)
Parent: SPARK-2468

> Migrate Netty network module from Java to Scala
> ---
>
> Key: SPARK-2936
> URL: https://issues.apache.org/jira/browse/SPARK-2936
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 1.1.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> The netty network module was originally written when Scala 2.9.x had a bug 
> that prevents a pure Scala implementation, and a subset of the files were 
> done in Java. We have since upgraded to Scala 2.10, and can migrate all Java 
> files now to Scala.
> https://github.com/netty/netty/issues/781
> https://github.com/mesos/spark/pull/522



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2939) Support fetching in-memory blocks for Netty network module

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2939:
---

Target Version/s: 1.2.0

> Support fetching in-memory blocks for Netty network module
> --
>
> Key: SPARK-2939
> URL: https://issues.apache.org/jira/browse/SPARK-2939
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2942) Report error messages back from server to client

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2942:
---

Target Version/s: 1.2.0

> Report error messages back from server to client
> 
>
> Key: SPARK-2942
> URL: https://issues.apache.org/jira/browse/SPARK-2942
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>
> Similar to SPARK-2583, but for the Netty module.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2943) Create config options for Netty sendBufferSize and receiveBufferSize

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2943:
--

 Summary: Create config options for Netty sendBufferSize and 
receiveBufferSize
 Key: SPARK-2943
 URL: https://issues.apache.org/jira/browse/SPARK-2943
 Project: Spark
  Issue Type: Sub-task
Reporter: Reynold Xin






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Matei Zaharia (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091693#comment-14091693
 ] 

Matei Zaharia commented on SPARK-2926:
--

Basically because of these things, it would be great to see results from a 
prototype to decide whether we want a different code path here. One place where 
it might help a lot is sortByKey, but even there, the sorting algorithm we use 
(TimSort) actually does take advantage of partially sorted runs. And we have 
the problem of not passing an ordering on the map side yet. To me it's really 
non-obvious whether sorting on the map side will make the overall job faster -- 
it might just move CPU cost around from the reduce task to the map task.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-09 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-2944:


 Summary: sc.makeRDD doesn't distribute partitions evenly
 Key: SPARK-2944
 URL: https://issues.apache.org/jira/browse/SPARK-2944
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng
Priority: Critical


16 nodes EC2 cluster:

{code}
val rdd = sc.makeRDD(0 until 1e9, 1000).cache()
rdd.count()
{code}

Saw 156 partitions on one node while only 8 partitions on another.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-09 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2944:
-

Description: 
16 nodes EC2 cluster:

{code}
val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
rdd.count()
{code}

Saw 156 partitions on one node while only 8 partitions on another.


  was:
16 nodes EC2 cluster:

{code}
val rdd = sc.makeRDD(0 until 1e9, 1000).cache()
rdd.count()
{code}

Saw 156 partitions on one node while only 8 partitions on another.



> sc.makeRDD doesn't distribute partitions evenly
> ---
>
> Key: SPARK-2944
> URL: https://issues.apache.org/jira/browse/SPARK-2944
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> 16 nodes EC2 cluster:
> {code}
> val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
> rdd.count()
> {code}
> Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2861) Doc comment of DoubleRDDFunctions.histogram is incorrect

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2861.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1786
[https://github.com/apache/spark/pull/1786]

> Doc comment of DoubleRDDFunctions.histogram is incorrect
> 
>
> Key: SPARK-2861
> URL: https://issues.apache.org/jira/browse/SPARK-2861
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Chandan Kumar
>Priority: Trivial
> Fix For: 1.1.0
>
>
> The documentation comment of histogram method of DoubleRDDFunctions class in 
> source file DoubleRDDFunctions.scala is  inconsistent. This might confuse 
> somebody reading the documentation.
> Comment in question:
> {code}
>   /**
>* Compute a histogram using the provided buckets. The buckets are all open
>* to the left except for the last which is closed
>*  e.g. for the array
>*  [1, 10, 20, 50] the buckets are [1, 10) [10, 20) [20, 50]
>*  e.g 1<=x<10 , 10<=x<20, 20<=x<50
>*  And on the input of 1 and 50 we would have a histogram of 1, 0, 0
> {code}
> The buckets are all open to the right (NOT left) except for the last which is 
> closed
> For the example quoted, the last bucket should be 20<=x<=50.
> Also, the histogram result on input of 1 and 50 would be 1, 0, 1 (NOT 1, 0, 
> 0). This works correctly in Spark but the doc comment is incorrect.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2861) Doc comment of DoubleRDDFunctions.histogram is incorrect

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2861:
---

Assignee: Chandan Kumar

> Doc comment of DoubleRDDFunctions.histogram is incorrect
> 
>
> Key: SPARK-2861
> URL: https://issues.apache.org/jira/browse/SPARK-2861
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Chandan Kumar
>Assignee: Chandan Kumar
>Priority: Trivial
> Fix For: 1.1.0
>
>
> The documentation comment of histogram method of DoubleRDDFunctions class in 
> source file DoubleRDDFunctions.scala is  inconsistent. This might confuse 
> somebody reading the documentation.
> Comment in question:
> {code}
>   /**
>* Compute a histogram using the provided buckets. The buckets are all open
>* to the left except for the last which is closed
>*  e.g. for the array
>*  [1, 10, 20, 50] the buckets are [1, 10) [10, 20) [20, 50]
>*  e.g 1<=x<10 , 10<=x<20, 20<=x<50
>*  And on the input of 1 and 50 we would have a histogram of 1, 0, 0
> {code}
> The buckets are all open to the right (NOT left) except for the last which is 
> closed
> For the example quoted, the last bucket should be 20<=x<=50.
> Also, the histogram result on input of 1 and 50 would be 1, 0, 1 (NOT 1, 0, 
> 0). This works correctly in Spark but the doc comment is incorrect.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091699#comment-14091699
 ] 

Patrick Wendell commented on SPARK-2944:


Hey [~mengxr], do you know how the behavior differs from Spark 1.0? Also, if 
there is a clear difference, could you see if the behavior is modified by this 
patch?

https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=63bdb1f41b4895e3a9444f7938094438a94d3007

> sc.makeRDD doesn't distribute partitions evenly
> ---
>
> Key: SPARK-2944
> URL: https://issues.apache.org/jira/browse/SPARK-2944
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> 16 nodes EC2 cluster:
> {code}
> val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
> rdd.count()
> {code}
> Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2945:


 Summary: Allow specifying num of executors in the context 
configuration
 Key: SPARK-2945
 URL: https://issues.apache.org/jira/browse/SPARK-2945
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky


Running on YARN, the only way to specify the number of executors seems to be on 
the command line of spark-submit, via the --num-executors switch.

In many cases this is too early. Our Spark app receives some cmdline arguments 
which determine the amount of work that needs to be done - and that affects the 
number of executors it ideally requires. Ideally, the Spark context 
configuration would support specifying this like any other config param.

Our current workaround is a wrapper script that determines how much work is 
needed, and which itself launches spark-submit with the number passed to 
--num-executors - it's a shame to have to do this.




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2946) Allow specifying * for --num-executors in YARN

2014-08-09 Thread Shay Rojansky (JIRA)
Shay Rojansky created SPARK-2946:


 Summary: Allow specifying * for --num-executors in YARN
 Key: SPARK-2946
 URL: https://issues.apache.org/jira/browse/SPARK-2946
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: Ubuntu precise, on YARN (CDH 5.1.0)
Reporter: Shay Rojansky
Priority: Minor


It would be useful to allow specifying --num-executors * when submitting jobs 
to YARN, and to have Spark automatically determine how many total cores are 
available in the cluster by querying YARN.

Our scenario is multiple users running research batch jobs. We never want to 
have a situation where cluster resources aren't being used, so ideally users 
would specify * and let YARN scheduling and preemption ensure fairness.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091700#comment-14091700
 ] 

Patrick Wendell commented on SPARK-2931:


[~matei] Hey Matei - IIRC you looked at this patch a bunch. Do you have any 
guesses as to what is causing this?

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091701#comment-14091701
 ] 

Patrick Wendell commented on SPARK-2945:


Hey [~roji] I believe this already exists - there is an option 
"spark.executor.instances". I think in the past we didn't document this, but we 
probably should. [~sandyr] should be able to confirm this as well.

> Allow specifying num of executors in the context configuration
> --
>
> Key: SPARK-2945
> URL: https://issues.apache.org/jira/browse/SPARK-2945
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0
> Environment: Ubuntu precise, on YARN (CDH 5.1.0)
>Reporter: Shay Rojansky
>
> Running on YARN, the only way to specify the number of executors seems to be 
> on the command line of spark-submit, via the --num-executors switch.
> In many cases this is too early. Our Spark app receives some cmdline 
> arguments which determine the amount of work that needs to be done - and that 
> affects the number of executors it ideally requires. Ideally, the Spark 
> context configuration would support specifying this like any other config 
> param.
> Our current workaround is a wrapper script that determines how much work is 
> needed, and which itself launches spark-submit with the number passed to 
> --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2945:
---

Component/s: YARN

> Allow specifying num of executors in the context configuration
> --
>
> Key: SPARK-2945
> URL: https://issues.apache.org/jira/browse/SPARK-2945
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.0
> Environment: Ubuntu precise, on YARN (CDH 5.1.0)
>Reporter: Shay Rojansky
>
> Running on YARN, the only way to specify the number of executors seems to be 
> on the command line of spark-submit, via the --num-executors switch.
> In many cases this is too early. Our Spark app receives some cmdline 
> arguments which determine the amount of work that needs to be done - and that 
> affects the number of executors it ideally requires. Ideally, the Spark 
> context configuration would support specifying this like any other config 
> param.
> Our current workaround is a wrapper script that determines how much work is 
> needed, and which itself launches spark-submit with the number passed to 
> --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Shay Rojansky (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091702#comment-14091702
 ] 

Shay Rojansky commented on SPARK-2945:
--

That would be great news indeed, I didn't find any documentation for this...

I'll test this tomorrow and confirm here whether it works as expected. Thanks!

> Allow specifying num of executors in the context configuration
> --
>
> Key: SPARK-2945
> URL: https://issues.apache.org/jira/browse/SPARK-2945
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.0
> Environment: Ubuntu precise, on YARN (CDH 5.1.0)
>Reporter: Shay Rojansky
>
> Running on YARN, the only way to specify the number of executors seems to be 
> on the command line of spark-submit, via the --num-executors switch.
> In many cases this is too early. Our Spark app receives some cmdline 
> arguments which determine the amount of work that needs to be done - and that 
> affects the number of executors it ideally requires. Ideally, the Spark 
> context configuration would support specifying this like any other config 
> param.
> Our current workaround is a wrapper script that determines how much work is 
> needed, and which itself launches spark-submit with the number passed to 
> --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Kay Ousterhout (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091703#comment-14091703
 ] 

Kay Ousterhout commented on SPARK-2931:
---

Josh and I chatted a bit about this offline.  I suspect what's happening is:

Here we recompute myLocalityLevels: 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L693

...but don't reset currentLocalityLevel.

So, it seems like if the number of locality levels decreases, 
currentLocalityLevel can end being higher than the set of allowed levels.  If 
this is indeed the problem, it looks like the code was last changed by the 
patch you mentioned Patrick/Josh.

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-08-09 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091738#comment-14091738
 ] 

Vincenzo Selvaggio commented on SPARK-1406:
---

I agree with Sean, I could see the export to PMML quite useful as it will 
decouple an application (wanting only to do scoring) from the evaluation of the 
model that can run on a full blown Spark cluster.

However, I am not sure about using JPMML to generate the PMML, for sure it will 
be the easier option, but what about licensing? 
https://github.com/jpmml/jpmml-model is BSD 3-Clause while of course Spark is 
Apache 2.0.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-08-09 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091741#comment-14091741
 ] 

Sean Owen commented on SPARK-1406:
--

BSD 3-Clause is compatible with AL2. Spark's distribution already distributes 
components under this license. See 
http://www.apache.org/dev/licensing-howto.html#permissive-deps

The license issue with JPMML is that its openscoring module is AGPL, which 
isn't compatible. https://github.com/jpmml/openscoring

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1406) PMML model evaluation support via MLib

2014-08-09 Thread Vincenzo Selvaggio (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091747#comment-14091747
 ] 

Vincenzo Selvaggio commented on SPARK-1406:
---

Thanks for clarifying.

> PMML model evaluation support via MLib
> --
>
> Key: SPARK-1406
> URL: https://issues.apache.org/jira/browse/SPARK-1406
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Thomas Darimont
>
> It would be useful if spark would provide support the evaluation of PMML 
> models (http://www.dmg.org/v4-2/GeneralStructure.html).
> This would allow to use analytical models that were created with a 
> statistical modeling tool like R, SAS, SPSS, etc. with Spark (MLib) which 
> would perform the actual model evaluation for a given input tuple. The PMML 
> model would then just contain the "parameterization" of an analytical model.
> Other projects like JPMML-Evaluator do a similar thing.
> https://github.com/jpmml/jpmml/tree/master/pmml-evaluator



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091746#comment-14091746
 ] 

Mridul Muralidharan commented on SPARK-2931:


[~kayousterhout] this is weird, I remember mentioned this exact same issue in 
some PR for 1.1 (trying to find which one, though not 1313 iirc); and I think 
it was supposed to have been addressed.
We had observed this issue of currentLocalityLevel running away when we had 
internally merged the pr.

Strange that it was not addressed, speaks volumes of me not following up on my 
reviews !

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091753#comment-14091753
 ] 

Saisai Shao commented on SPARK-2926:


Hi Matei, thanks a lot for your comments.

The original point of this proposal is to directly merge the data without 
re-sorting (when spilling using EAOM) if data is by-key sorted (with 
keyOrdering or hashcode ordering) in map outputs. If there is no Ordering 
available or needed like groupByKey, current design thinking is still using 
EAOM to do aggregation.

I use SparkPerf sort-by-key workload to test the current shuffle 
implementations:

1. sort shuffle write with hash shuffle read (current sort-based shuffle 
implementation).
2. sort shuffle write and sort merge shuffle read (my prototype).

Test data type is String, key and value length is 10, and record number is 2G, 
data is stored in HDFS. My rough test shows that my prototype may be slower in 
shuffle write (1.18x slower) because of another key comparison, but 2.6x faster 
than HashShuffleReader in reduce side. 

I have to admit that only sort-by-key cannot well illustrate the necessity of 
this proposal, also the method is better for sortByKey scenario. I will 
continue to do some other workload tests to see if this method is really 
necessary or not. I will post my test result later.

Thanks again for your comments.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle

2014-08-09 Thread Saisai Shao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091753#comment-14091753
 ] 

Saisai Shao edited comment on SPARK-2926 at 8/9/14 2:09 PM:


Hi Matei, thanks a lot for your comments.

The original point of this proposal is to directly merge the data without 
re-sorting (when spilling using EAOM) if data is by-key sorted (with 
keyOrdering or hashcode ordering) in map outputs. If there is no Ordering 
available or needed like groupByKey, current design thinking is still using 
EAOM to do aggregation.

I use SparkPerf sort-by-key workload to test the current shuffle 
implementations:

1. sort shuffle write with hash shuffle read (current sort-based shuffle 
implementation).
2. sort shuffle write and sort merge shuffle read (my prototype).

Test data type is String, key and value length is 10, and record number is 2G, 
data is stored in HDFS. My rough test shows that my prototype may be slower in 
shuffle write (1.18x slower) because of another key comparison, but 2.6x faster 
than HashShuffleReader in reduce side. 

I have to admit that only sort-by-key cannot well illustrate the necessity of 
this proposal, also the method is better for sortByKey scenario. I will 
continue to do some other workload tests to see if this method is really 
necessary or not. I will post my test result later.

At least I think this method can use memory more effectively and alleviate GC 
effect, because it stores map output partitions in memory with raw ByteBuffer, 
and need not to main a large hash map to do aggregation.

Thanks again for your comments :).


was (Author: jerryshao):
Hi Matei, thanks a lot for your comments.

The original point of this proposal is to directly merge the data without 
re-sorting (when spilling using EAOM) if data is by-key sorted (with 
keyOrdering or hashcode ordering) in map outputs. If there is no Ordering 
available or needed like groupByKey, current design thinking is still using 
EAOM to do aggregation.

I use SparkPerf sort-by-key workload to test the current shuffle 
implementations:

1. sort shuffle write with hash shuffle read (current sort-based shuffle 
implementation).
2. sort shuffle write and sort merge shuffle read (my prototype).

Test data type is String, key and value length is 10, and record number is 2G, 
data is stored in HDFS. My rough test shows that my prototype may be slower in 
shuffle write (1.18x slower) because of another key comparison, but 2.6x faster 
than HashShuffleReader in reduce side. 

I have to admit that only sort-by-key cannot well illustrate the necessity of 
this proposal, also the method is better for sortByKey scenario. I will 
continue to do some other workload tests to see if this method is really 
necessary or not. I will post my test result later.

Thanks again for your comments.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> --
>
> Key: SPARK-2926
> URL: https://issues.apache.org/jira/browse/SPARK-2926
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 1.1.0
>Reporter: Saisai Shao
> Attachments: SortBasedShuffleRead.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which 
> greatly improve the IO performance and reduce the memory consumption when 
> reducer number is very large. But for the reducer side, it still adopts the 
> implementation of hash-based shuffle reader, which neglects the ordering 
> attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based 
> shuffle to better improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later 
> when some unit test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2947) DAGScheduler scheduling dead cycle

2014-08-09 Thread Guoqiang Li (JIRA)
Guoqiang Li created SPARK-2947:
--

 Summary: DAGScheduler scheduling dead cycle
 Key: SPARK-2947
 URL: https://issues.apache.org/jira/browse/SPARK-2947
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2, 1.0.0
Reporter: Guoqiang Li
Priority: Blocker


Stage to resubmit more than 5 times.
This seems to be caused by {{FetchFailed.bmAddress}} is null .
I don't know how to reproduce it.

log:
{noformat}
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 
52334 on executor 82: sanshan (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
3060 bytes in 0 ms
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 
52335 on executor 78: tuan231 (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
3060 bytes in 0 ms
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141)
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO

[jira] [Updated] (SPARK-2947) DAGScheduler scheduling dead cycle

2014-08-09 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Fix Version/s: 1.0.3

> DAGScheduler scheduling dead cycle
> --
>
> Key: SPARK-2947
> URL: https://issues.apache.org/jira/browse/SPARK-2947
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
>Reporter: Guoqiang Li
>Priority: Blocker
> Fix For: 1.1.0, 1.0.3
>
>
> Stage to resubmit more than 5 times.
> This seems to be caused by {{FetchFailed.bmAddress}} is null .
> I don't know how to reproduce it.
> log:
> {noformat}
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
> TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
> TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
> 1.189:141)
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO 

[jira] [Updated] (SPARK-2947) DAGScheduler scheduling dead cycle

2014-08-09 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Fix Version/s: 1.1.0

> DAGScheduler scheduling dead cycle
> --
>
> Key: SPARK-2947
> URL: https://issues.apache.org/jira/browse/SPARK-2947
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
>Reporter: Guoqiang Li
>Priority: Blocker
> Fix For: 1.1.0, 1.0.3
>
>
> Stage to resubmit more than 5 times.
> This seems to be caused by {{FetchFailed.bmAddress}} is null .
> I don't know how to reproduce it.
> log:
> {noformat}
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
> TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
> TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
> 1.189:141)
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO 

[jira] [Updated] (SPARK-2947) DAGScheduler scheduling dead cycle

2014-08-09 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Description: 
Stage to resubmit more than 5 times.
This seems to be caused by {{FetchFailed.bmAddress}} is null .
I don't know how to reproduce it.

log:
{noformat}
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 
52334 on executor 82: sanshan (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
3060 bytes in 0 ms
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 
52335 on executor 78: tuan231 (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
3060 bytes in 0 ms
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141)
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission

 -- 5 times ---

14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
1.189, whose tasks have all completed, from pool 
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 ms 
on jilin (progress: 280/280)
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 269)
14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 2.1, 
whose tasks have all completed, from pool 
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
DealCF.scala:207) finished in 129.544 s
{noformat}

  was:
Stage to resubmit more than 5 times.
This seems to be caused by {{FetchFailed.bmAddress}} is null .
I don't know how to reproduce it.

log:
{noformat}
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 
52334 on executor 82: sanshan (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
3060 bytes in 0 ms
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 
52335 on executor 78: tuan231 (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
3060 bytes in 0 ms
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141)
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:

[jira] [Commented] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-09 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091784#comment-14091784
 ] 

Xiangrui Meng commented on SPARK-2944:
--

I checked that one first. It was okay after that commit, and it was bad before 
this one:

https://github.com/apache/spark/commit/28dbae85aaf6842e22cd7465cb11cb34d58fc56d

I didn't see anything suspicious in between, doing a binary search now 

> sc.makeRDD doesn't distribute partitions evenly
> ---
>
> Key: SPARK-2944
> URL: https://issues.apache.org/jira/browse/SPARK-2944
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> 16 nodes EC2 cluster:
> {code}
> val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
> rdd.count()
> {code}
> Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2945) Allow specifying num of executors in the context configuration

2014-08-09 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2945?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091817#comment-14091817
 ] 

Sandy Ryza commented on SPARK-2945:
---

spark.executor.instances apparently isn't used for anything other than the 
calculations for how long to wait for executors to register before running 
jobs.  So I believe there actually would be some work required to make 
spark.executor.instances function this way.

> Allow specifying num of executors in the context configuration
> --
>
> Key: SPARK-2945
> URL: https://issues.apache.org/jira/browse/SPARK-2945
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, YARN
>Affects Versions: 1.0.0
> Environment: Ubuntu precise, on YARN (CDH 5.1.0)
>Reporter: Shay Rojansky
>
> Running on YARN, the only way to specify the number of executors seems to be 
> on the command line of spark-submit, via the --num-executors switch.
> In many cases this is too early. Our Spark app receives some cmdline 
> arguments which determine the amount of work that needs to be done - and that 
> affects the number of executors it ideally requires. Ideally, the Spark 
> context configuration would support specifying this like any other config 
> param.
> Our current workaround is a wrapper script that determines how much work is 
> needed, and which itself launches spark-submit with the number passed to 
> --num-executors - it's a shame to have to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2931:
--

Attachment: scala-sort-by-key.err

@ [~kayousterhout]: I can see how that code in {{executorLost()}} could be 
problematic, but I don't think that's what's triggering this bug since I don't 
see any "Re-queueing tasks for ..." messages in my log.

I've attached a full log from a failing test run.  If it helps, I can re-run at 
DEBUG logging or add extra logging statements.

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: scala-sort-by-key.err
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091849#comment-14091849
 ] 

Mridul Muralidharan commented on SPARK-2931:


Checking more, it might have been for an internal review : not sure, cant find 
an external reference.
So the issue is as [~kayousterhout] described, after locality levels are 
updated (as part of an executor loss, etc); the index is not updated 
accordingly (or reset).

The earlier code was assuming the population of the data structures wont change 
once created - which is no longer the case.
A testcase to simulate this should be possible - rough sketch :
a) Add two executors one PROCESS_LOCAL executor exec1 and one ANY executors 
exec2.
my locality levels should now contain all levels now due to exec1
b) schedule a task on the process_local executors, and wait sufficiently such 
that level is now at rack or any.
c) fail the process_local executor - iirc currently this will not cause 
recomputation of levels.
d) Add an ANY executor exec3 - this will trigger computeValidLocalityLevels in 
executorAdded (exec failure does not, we probably should add that).
e) Now, any invocation of getAllowedLocalityLevel will cause the exception Josh 
mentioned.


and ensure the corresponding cleanup is triggered; to cause an update


> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: scala-sort-by-key.err
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministical

[jira] [Updated] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Mridul Muralidharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mridul Muralidharan updated SPARK-2931:
---

Attachment: test.patch

A patch to showcase the exception

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2931) getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException

2014-08-09 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091881#comment-14091881
 ] 

Mridul Muralidharan commented on SPARK-2931:


[~joshrosen] [~kayousterhout] Added a patch which deterministically showcases 
the bug - should be easy to fix it now I hope :-)

> getAllowedLocalityLevel() throws ArrayIndexOutOfBoundsException
> ---
>
> Key: SPARK-2931
> URL: https://issues.apache.org/jira/browse/SPARK-2931
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Spark EC2, spark-1.1.0-snapshot1, sort-by-key spark-perf 
> benchmark
>Reporter: Josh Rosen
>Priority: Blocker
> Fix For: 1.1.0
>
> Attachments: scala-sort-by-key.err, test.patch
>
>
> When running Spark Perf's sort-by-key benchmark on EC2 with v1.1.0-snapshot, 
> I get the following errors (one per task):
> {code}
> 14/08/08 18:54:22 INFO scheduler.TaskSetManager: Starting task 39.0 in stage 
> 0.0 (TID 39, ip-172-31-14-30.us-west-2.compute.internal, PROCESS_LOCAL, 1003 
> bytes)
> 14/08/08 18:54:22 INFO cluster.SparkDeploySchedulerBackend: Registered 
> executor: 
> Actor[akka.tcp://sparkexecu...@ip-172-31-9-213.us-west-2.compute.internal:58901/user/Executor#1436065036]
>  with ID 0
> 14/08/08 18:54:22 ERROR actor.OneForOneStrategy: 1
> java.lang.ArrayIndexOutOfBoundsException: 1
>   at 
> org.apache.spark.scheduler.TaskSetManager.getAllowedLocalityLevel(TaskSetManager.scala:475)
>   at 
> org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:409)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7$$anonfun$apply$2.apply$mcVI$sp(TaskSchedulerImpl.scala:261)
>   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:257)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$7.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3.apply(TaskSchedulerImpl.scala:254)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.resourceOffers(TaskSchedulerImpl.scala:254)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor.makeOffers(CoarseGrainedSchedulerBackend.scala:153)
>   at 
> org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend$DriverActor$$anonfun$receive$1.applyOrElse(CoarseGrainedSchedulerBackend.scala:103)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>   at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>   at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}
> This causes the job to hang.
> I can deterministically reproduce this by re-running the test, either in 
> isolation or as part of the full performance testing suite.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-09 Thread Kousuke Saruta (JIRA)
Kousuke Saruta created SPARK-2948:
-

 Summary: PySpark doesn't work on Python 2.6
 Key: SPARK-2948
 URL: https://issues.apache.org/jira/browse/SPARK-2948
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.0.2
 Environment: CentOS 6.5 / Python 2.6.6
Reporter: Kousuke Saruta
Priority: Blocker


In serializser.py, collections.namedtuple is redefined as follows.

{code}
def namedtuple(name, fields, verbose=False, rename=False):  
  
cls = _old_namedtuple(name, fields, verbose, rename)
  
return _hack_namedtuple(cls)
  
 
{code}

The number of arguments is 4 but the number of arguments of namedtuple for 
Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091907#comment-14091907
 ] 

Apache Spark commented on SPARK-2948:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1868

> PySpark doesn't work on Python 2.6
> --
>
> Key: SPARK-2948
> URL: https://issues.apache.org/jira/browse/SPARK-2948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.0.2
> Environment: CentOS 6.5 / Python 2.6.6
>Reporter: Kousuke Saruta
>Priority: Blocker
>
> In serializser.py, collections.namedtuple is redefined as follows.
> {code}
> def namedtuple(name, fields, verbose=False, rename=False):
>   
>   
> cls = _old_namedtuple(name, fields, verbose, rename)  
>   
>   
> return _hack_namedtuple(cls)  
>   
>   
>  
> {code}
> The number of arguments is 4 but the number of arguments of namedtuple for 
> Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2949) SparkContext does not fate-share with ActorSystem

2014-08-09 Thread Aaron Davidson (JIRA)
Aaron Davidson created SPARK-2949:
-

 Summary: SparkContext does not fate-share with ActorSystem
 Key: SPARK-2949
 URL: https://issues.apache.org/jira/browse/SPARK-2949
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Aaron Davidson


It appears that an uncaught fatal error in Spark's Driver ActorSystem does not 
cause the SparkContext to terminate. We observed an issue in production that 
caused a PermGen error, but it just kept throwing this error:

{code}
14/08/09 15:07:24 ERROR ActorSystemImpl: Uncaught fatal error from thread 
[spark-akka.actor.default-dispatcher-26] shutting down ActorSystem [spark]
java.lang.OutOfMemoryError: PermGen space
{code}

We should probably do something similar for what we did in the DAGSCheduler and 
ensure that we call SparkContext#stop() if the entire ActorSystem dies with a 
fatal error.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2948) PySpark doesn't work on Python 2.6

2014-08-09 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-2948:
--

Affects Version/s: (was: 1.0.2)
   1.1.0

> PySpark doesn't work on Python 2.6
> --
>
> Key: SPARK-2948
> URL: https://issues.apache.org/jira/browse/SPARK-2948
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.1.0
> Environment: CentOS 6.5 / Python 2.6.6
>Reporter: Kousuke Saruta
>Priority: Blocker
>
> In serializser.py, collections.namedtuple is redefined as follows.
> {code}
> def namedtuple(name, fields, verbose=False, rename=False):
>   
>   
> cls = _old_namedtuple(name, fields, verbose, rename)  
>   
>   
> return _hack_namedtuple(cls)  
>   
>   
>  
> {code}
> The number of arguments is 4 but the number of arguments of namedtuple for 
> Python 2.6 is 3 so mismatch is occurred.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2923) Implement some basic linalg operations in MLlib

2014-08-09 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091937#comment-14091937
 ] 

Michael Yannakopoulos commented on SPARK-2923:
--

Hi Xiangrui,

I didn't have internet for the past few days. Can I help with this issue?

> Implement some basic linalg operations in MLlib
> ---
>
> Key: SPARK-2923
> URL: https://issues.apache.org/jira/browse/SPARK-2923
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> We use breeze for linear algebra operations. Breeze operations are 
> user-friendly but there are some concerns:
> 1. creating temp objects, e.g., `val z = a * x + b * y`
> 2. multi-method is not used in some operators, e.g., `axpy`. If we pass in 
> SparseVector as a generic Vector, it will use activeIterator, which is slow
> 3. calling native BLAS if it is available, which might not be good for 
> level-1 methods
> Having some basic BLAS operations implemented in MLlib can help simplify the 
> current implementation and improve some performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-09 Thread Michael Yannakopoulos (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091940#comment-14091940
 ] 

Michael Yannakopoulos commented on SPARK-2871:
--

I can help with this issue. My understanding is that you would like to 
incorporate the above scala functionality in python. Is that correct?

> Missing API in PySpark
> --
>
> Key: SPARK-2871
> URL: https://issues.apache.org/jira/browse/SPARK-2871
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> There are several APIs missing in PySpark:
> RDD.collectPartitions()
> RDD.histogram()
> RDD.zipWithIndex()
> RDD.zipWithUniqueId()
> RDD.min(comp)
> RDD.max(comp)
> A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2950) Add GC time and Shuffle Write time to JobLogger output

2014-08-09 Thread Shivaram Venkataraman (JIRA)
Shivaram Venkataraman created SPARK-2950:


 Summary: Add GC time and Shuffle Write time to JobLogger output
 Key: SPARK-2950
 URL: https://issues.apache.org/jira/browse/SPARK-2950
 Project: Spark
  Issue Type: Improvement
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Minor


The JobLogger is very useful for performing offline performance profiling of 
Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are 
currently missed from the JobLogger output. This change adds these two fields.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2951) SerDeUtils.pythonToPairRDD fails on RDDs of pickled array.arrays in Python 2.6

2014-08-09 Thread Josh Rosen (JIRA)
Josh Rosen created SPARK-2951:
-

 Summary: SerDeUtils.pythonToPairRDD fails on RDDs of pickled 
array.arrays in Python 2.6
 Key: SPARK-2951
 URL: https://issues.apache.org/jira/browse/SPARK-2951
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Josh Rosen


With Python 2.6, calling SerDeUtils.pythonToPairRDD() on an RDD of pickled 
Python array.arrays will fail with this exception:

{code}
ava.lang.ClassCastException: java.lang.String cannot be cast to 
java.util.ArrayList

net.razorvine.pickle.objects.ArrayConstructor.construct(ArrayConstructor.java:33)
net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:617)
net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:170)
net.razorvine.pickle.Unpickler.load(Unpickler.java:84)
net.razorvine.pickle.Unpickler.loads(Unpickler.java:97)

org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)

org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToPairRDD$1$$anonfun$5.apply(SerDeUtil.scala:106)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:898)

org.apache.spark.rdd.PairRDDFunctions$$anonfun$12.apply(PairRDDFunctions.scala:880)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
org.apache.spark.scheduler.Task.run(Task.scala:54)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
{code}

I think this is due to a difference in how array.array is pickled in Python 2.6 
vs. Python 2.7.  To see this, run the following script:

{code}
from pickletools import dis, optimize
from pickle import dumps, loads, HIGHEST_PROTOCOL
from array import array

arr = array('d', [1, 2, 3])

#protocol = HIGHEST_PROTOCOL
protocol = 0

pickled = dumps(arr, protocol=protocol)
pickled = optimize(pickled)
unpickled = loads(pickled)

print arr
print unpickled

print dis(pickled)
{code}

In Python 2.7, this outputs

{code}
array('d', [1.0, 2.0, 3.0])
array('d', [1.0, 2.0, 3.0])
0: cGLOBAL 'array array'
   13: (MARK
   14: SSTRING 'd'
   19: (MARK
   20: lLIST   (MARK at 19)
   21: FFLOAT  1.0
   26: aAPPEND
   27: FFLOAT  2.0
   32: aAPPEND
   33: FFLOAT  3.0
   38: aAPPEND
   39: tTUPLE  (MARK at 13)
   40: RREDUCE
   41: .STOP
highest protocol among opcodes = 0
None
{code}

whereas 2.6 outputs

{code}
array('d', [1.0, 2.0, 3.0])
array('d', [1.0, 2.0, 3.0])
0: cGLOBAL 'array array'
   13: (MARK
   14: SSTRING 'd'
   19: SSTRING 
'\x00\x00\x00\x00\x00\x00\xf0?\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
  110: tTUPLE  (MARK at 13)
  111: RREDUCE
  112: .STOP
highest protocol among opcodes = 0
None
{code}

I think the Java-side depickling library doesn't expect this pickled format, 
causing this failure.

I noticed this when running PySpark's unit tests on 2.6 because the 
TestOuputFormat.test_newhadoop test failed.

I think that this issue affects all of the methods that might need to depickle 
arrays in Java, including all of the Hadoop output format methods.

How should we try to fix this?  Require that users upgrade to 2.7 if they want 
to use code that requires this?  Open a bug with the depickling library 
maintainers?  Try to hack in our own pickling routines for arrays if we detect 
that we're using 2.6?



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2947) DAGScheduler scheduling dead cycle

2014-08-09 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Description: 
Stage to resubmit more than 5 times.
This seems to be caused by {{FetchFailed.bmAddress}} is null .
I don't know how to reproduce it.

master log:
{noformat}
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as TID 
52334 on executor 82: sanshan (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
3060 bytes in 0 ms
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as TID 
52335 on executor 78: tuan231 (PROCESS_LOCAL)
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
3060 bytes in 0 ms
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 1.189:141)
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch failure 
from null
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission

 -- 5 times ---

14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
DealCF.scala:215) for resubmision due to a fetch failure
14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from Stage 
2 (flatMap at DealCF.scala:207); marking it for resubmission
14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
1.189, whose tasks have all completed, from pool 
14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 ms 
on jilin (progress: 280/280)
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 269)
14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 2.1, 
whose tasks have all completed, from pool 
14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
DealCF.scala:207) finished in 129.544 s
{noformat}

worker: log
{noformat}
/1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
computing it
14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
computing it
14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18017
14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18151
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
computing it
14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
computing it
14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18285
14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18419
14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
computing it
14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
computing it
14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18535
14/08/09 21:49:42 INFO executor.Executor: Running task ID 18535
14/08/09 21:49:42 INFO storage.BlockManager: Found block broadcast_1 locally
14/08/09 21:49:42 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 
18669
14/08/09 21:49:42 INFO executor.Executor: Running task ID 18669
14/08/09 21:49:4

[jira] [Commented] (SPARK-2871) Missing API in PySpark

2014-08-09 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091986#comment-14091986
 ] 

Josh Rosen commented on SPARK-2871:
---

There's actually an open PR for this that's currently being reviewed (odd that 
it wasn't automatically linked):

https://github.com/apache/spark/pull/1791


> Missing API in PySpark
> --
>
> Key: SPARK-2871
> URL: https://issues.apache.org/jira/browse/SPARK-2871
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>
> There are several APIs missing in PySpark:
> RDD.collectPartitions()
> RDD.histogram()
> RDD.zipWithIndex()
> RDD.zipWithUniqueId()
> RDD.min(comp)
> RDD.max(comp)
> A bunch of API related to approximate jobs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-1766) Move reduceByKey definitions next to each other in PairRDDFunctions

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-1766.


  Resolution: Fixed
   Fix Version/s: 1.1.0
Assignee: Chris Cope
Target Version/s: 1.1.0

Resolved by: https://github.com/apache/spark/pull/1859

> Move reduceByKey definitions next to each other in PairRDDFunctions
> ---
>
> Key: SPARK-1766
> URL: https://issues.apache.org/jira/browse/SPARK-1766
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Chris Cope
>Priority: Trivial
> Fix For: 1.1.0
>
>
> Sorry, I know this is pedantic, but I've been browsing the source multiple 
> times and gotten fooled into thinking reduceByKey always requires a 
> partitioner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2894) spark-shell doesn't accept flags

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2894.


Resolution: Duplicate

> spark-shell doesn't accept flags
> 
>
> Key: SPARK-2894
> URL: https://issues.apache.org/jira/browse/SPARK-2894
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> {code}
> spark-shell --executor-memory 5G
> bad option '--executor-cores'
> {code}
> This is a regression.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2894) spark-shell doesn't accept flags

2014-08-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091996#comment-14091996
 ] 

Patrick Wendell commented on SPARK-2894:


I closed this in favor of SPARK-2678 since it had a more thorough description.

> spark-shell doesn't accept flags
> 
>
> Key: SPARK-2894
> URL: https://issues.apache.org/jira/browse/SPARK-2894
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Sandy Ryza
>Priority: Blocker
>
> {code}
> spark-shell --executor-memory 5G
> bad option '--executor-cores'
> {code}
> This is a regression.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2678) `Spark-submit` overrides user application options

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2678:
---

Assignee: Kousuke Saruta  (was: Cheng Lian)

> `Spark-submit` overrides user application options
> -
>
> Key: SPARK-2678
> URL: https://issues.apache.org/jira/browse/SPARK-2678
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Cheng Lian
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Here is an example:
> {code}
> ./bin/spark-submit --class Foo some.jar --help
> {code}
> SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
> should be recognized as a user application option. But it's actually 
> overriden by {{spark-submit}} and will show {{spark-submit}} help message.
> When directly invoking {{spark-submit}}, the constraints here are:
> # Options before primary resource should be recognized as {{spark-submit}} 
> options
> # Options after primary resource should be recognized as user application 
> options
> The tricky part is how to handle scripts like {{spark-shell}} that delegate  
> {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
> options like {{--master}} and user defined application options together. For 
> example, say we'd like to write a new script {{start-thriftserver.sh}} to 
> start the Hive Thrift server, basically we may do this:
> {code}
> $SPARK_HOME/bin/spark-submit --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
> {code}
> Then user may call this script like:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
> key=value
> {code}
> Notice that all options are captured by {{$@}}. If we put it before 
> {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
> {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
> {{spark-internal}}, they *should* all be recognized as options of 
> {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
> recognized as {{spark-submit}} option and leads to the right behavior.
> Although currently all scripts using {{spark-submit}} work correctly, we 
> still should fix this bug, because it causes option name collision between 
> {{spark-submit}} and user application, and every time we add a new option to 
> {{spark-submit}}, some existing user applications may break. However, solving 
> this bug may cause some incompatible changes.
> The suggested solution here is using {{--}} as separator of {{spark-submit}} 
> options and user application options. For the Hive Thrift server example 
> above, user should call it in this way:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
> key=value
> {code}
> And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
> options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2678) `Spark-submit` overrides user application options

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2678:
---

Fix Version/s: (was: 1.0.3)

> `Spark-submit` overrides user application options
> -
>
> Key: SPARK-2678
> URL: https://issues.apache.org/jira/browse/SPARK-2678
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Cheng Lian
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Here is an example:
> {code}
> ./bin/spark-submit --class Foo some.jar --help
> {code}
> SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
> should be recognized as a user application option. But it's actually 
> overriden by {{spark-submit}} and will show {{spark-submit}} help message.
> When directly invoking {{spark-submit}}, the constraints here are:
> # Options before primary resource should be recognized as {{spark-submit}} 
> options
> # Options after primary resource should be recognized as user application 
> options
> The tricky part is how to handle scripts like {{spark-shell}} that delegate  
> {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
> options like {{--master}} and user defined application options together. For 
> example, say we'd like to write a new script {{start-thriftserver.sh}} to 
> start the Hive Thrift server, basically we may do this:
> {code}
> $SPARK_HOME/bin/spark-submit --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
> {code}
> Then user may call this script like:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
> key=value
> {code}
> Notice that all options are captured by {{$@}}. If we put it before 
> {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
> {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
> {{spark-internal}}, they *should* all be recognized as options of 
> {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
> recognized as {{spark-submit}} option and leads to the right behavior.
> Although currently all scripts using {{spark-submit}} work correctly, we 
> still should fix this bug, because it causes option name collision between 
> {{spark-submit}} and user application, and every time we add a new option to 
> {{spark-submit}}, some existing user applications may break. However, solving 
> this bug may cause some incompatible changes.
> The suggested solution here is using {{--}} as separator of {{spark-submit}} 
> options and user application options. For the Hive Thrift server example 
> above, user should call it in this way:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
> key=value
> {code}
> And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
> options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2678) `Spark-submit` overrides user application options

2014-08-09 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2678:
---

Target Version/s: 1.1.0  (was: 1.1.0, 1.0.3)

> `Spark-submit` overrides user application options
> -
>
> Key: SPARK-2678
> URL: https://issues.apache.org/jira/browse/SPARK-2678
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Cheng Lian
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Here is an example:
> {code}
> ./bin/spark-submit --class Foo some.jar --help
> {code}
> SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
> should be recognized as a user application option. But it's actually 
> overriden by {{spark-submit}} and will show {{spark-submit}} help message.
> When directly invoking {{spark-submit}}, the constraints here are:
> # Options before primary resource should be recognized as {{spark-submit}} 
> options
> # Options after primary resource should be recognized as user application 
> options
> The tricky part is how to handle scripts like {{spark-shell}} that delegate  
> {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
> options like {{--master}} and user defined application options together. For 
> example, say we'd like to write a new script {{start-thriftserver.sh}} to 
> start the Hive Thrift server, basically we may do this:
> {code}
> $SPARK_HOME/bin/spark-submit --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
> {code}
> Then user may call this script like:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
> key=value
> {code}
> Notice that all options are captured by {{$@}}. If we put it before 
> {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
> {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
> {{spark-internal}}, they *should* all be recognized as options of 
> {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
> recognized as {{spark-submit}} option and leads to the right behavior.
> Although currently all scripts using {{spark-submit}} work correctly, we 
> still should fix this bug, because it causes option name collision between 
> {{spark-submit}} and user application, and every time we add a new option to 
> {{spark-submit}}, some existing user applications may break. However, solving 
> this bug may cause some incompatible changes.
> The suggested solution here is using {{--}} as separator of {{spark-submit}} 
> options and user application options. For the Hive Thrift server example 
> above, user should call it in this way:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
> key=value
> {code}
> And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
> options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2678) `Spark-submit` overrides user application options

2014-08-09 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14091997#comment-14091997
 ] 

Patrick Wendell commented on SPARK-2678:


This issue was fixed by:
https://github.com/apache/spark/pull/1825

> `Spark-submit` overrides user application options
> -
>
> Key: SPARK-2678
> URL: https://issues.apache.org/jira/browse/SPARK-2678
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.1, 1.0.2
>Reporter: Cheng Lian
>Assignee: Kousuke Saruta
>Priority: Blocker
> Fix For: 1.1.0
>
>
> Here is an example:
> {code}
> ./bin/spark-submit --class Foo some.jar --help
> {code}
> SInce {{--help}} appears behind the primary resource (i.e. {{some.jar}}), it 
> should be recognized as a user application option. But it's actually 
> overriden by {{spark-submit}} and will show {{spark-submit}} help message.
> When directly invoking {{spark-submit}}, the constraints here are:
> # Options before primary resource should be recognized as {{spark-submit}} 
> options
> # Options after primary resource should be recognized as user application 
> options
> The tricky part is how to handle scripts like {{spark-shell}} that delegate  
> {{spark-submit}}. These scripts allow users specify both {{spark-submit}} 
> options like {{--master}} and user defined application options together. For 
> example, say we'd like to write a new script {{start-thriftserver.sh}} to 
> start the Hive Thrift server, basically we may do this:
> {code}
> $SPARK_HOME/bin/spark-submit --class 
> org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 spark-internal $@
> {code}
> Then user may call this script like:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 --hiveconf 
> key=value
> {code}
> Notice that all options are captured by {{$@}}. If we put it before 
> {{spark-internal}}, they are all recognized as {{spark-submit}} options, thus 
> {{--hiveconf}} won't be passed to {{HiveThriftServer2}}; if we put it after 
> {{spark-internal}}, they *should* all be recognized as options of 
> {{HiveThriftServer2}}, but because of this bug, {{--master}} is still 
> recognized as {{spark-submit}} option and leads to the right behavior.
> Although currently all scripts using {{spark-submit}} work correctly, we 
> still should fix this bug, because it causes option name collision between 
> {{spark-submit}} and user application, and every time we add a new option to 
> {{spark-submit}}, some existing user applications may break. However, solving 
> this bug may cause some incompatible changes.
> The suggested solution here is using {{--}} as separator of {{spark-submit}} 
> options and user application options. For the Hive Thrift server example 
> above, user should call it in this way:
> {code}
> ./sbin/start-thriftserver.sh --master spark://some-host:7077 -- --hiveconf 
> key=value
> {code}
> And {{SparkSubmitArguments}} should be responsible for splitting two sets of 
> options and pass them correctly.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2952) Enable logging actor messages at DEBUG level

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2952:
--

 Summary: Enable logging actor messages at DEBUG level
 Key: SPARK-2952
 URL: https://issues.apache.org/jira/browse/SPARK-2952
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin


Logging actor messages can be very useful for debugging what went wrong in a 
distributed setting. For example, yesterday we ran into a problem in which we 
have no idea what was going on with the actor messages. 





--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2952) Enable logging actor messages at DEBUG level

2014-08-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092012#comment-14092012
 ] 

Apache Spark commented on SPARK-2952:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1870

> Enable logging actor messages at DEBUG level
> 
>
> Key: SPARK-2952
> URL: https://issues.apache.org/jira/browse/SPARK-2952
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Logging actor messages can be very useful for debugging what went wrong in a 
> distributed setting. For example, yesterday we ran into a problem in which we 
> have no idea what was going on with the actor messages. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2952) Enable logging actor messages at DEBUG level

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2952:
---

Target Version/s: 1.1.0

> Enable logging actor messages at DEBUG level
> 
>
> Key: SPARK-2952
> URL: https://issues.apache.org/jira/browse/SPARK-2952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Logging actor messages can be very useful for debugging what went wrong in a 
> distributed setting. For example, yesterday we ran into a problem in which we 
> have no idea what was going on with the actor messages. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2952) Enable logging actor messages at DEBUG level

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2952:
---

Component/s: Spark Core

> Enable logging actor messages at DEBUG level
> 
>
> Key: SPARK-2952
> URL: https://issues.apache.org/jira/browse/SPARK-2952
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Logging actor messages can be very useful for debugging what went wrong in a 
> distributed setting. For example, yesterday we ran into a problem in which we 
> have no idea what was going on with the actor messages. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2907) Use mutable.HashMap to represent Model in Word2Vec

2014-08-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2907?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092014#comment-14092014
 ] 

Apache Spark commented on SPARK-2907:
-

User 'Ishiihara' has created a pull request for this issue:
https://github.com/apache/spark/pull/1871

> Use mutable.HashMap to represent Model in Word2Vec
> --
>
> Key: SPARK-2907
> URL: https://issues.apache.org/jira/browse/SPARK-2907
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Liquan Pei
>Assignee: Liquan Pei
>
> Use mutable.HashMap to represent Word2Vec to reduce memory footprint and 
> shuffle size. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Issue Type: Improvement  (was: Bug)

> Allow using short names for io compression codecs
> -
>
> Key: SPARK-2953
> URL: https://issues.apache.org/jira/browse/SPARK-2953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier 
> to just accept "lz4", "lzf", "snappy".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2953) Allow using short names for io compression codecs

2014-08-09 Thread Reynold Xin (JIRA)
Reynold Xin created SPARK-2953:
--

 Summary: Allow using short names for io compression codecs
 Key: SPARK-2953
 URL: https://issues.apache.org/jira/browse/SPARK-2953
 Project: Spark
  Issue Type: Bug
Reporter: Reynold Xin
Assignee: Reynold Xin


Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier to 
just accept "lz4", "lzf", "snappy".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Component/s: Spark Core

> Allow using short names for io compression codecs
> -
>
> Key: SPARK-2953
> URL: https://issues.apache.org/jira/browse/SPARK-2953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier 
> to just accept "lz4", "lzf", "snappy".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2953) Allow using short names for io compression codecs

2014-08-09 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-2953:
---

Description: Instead of requiring 
"org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just 
accepts "lz4", "lzf", "snappy".  (was: Instead of requiring 
"org.apache.spark.io.LZ4CompressionCodec", it is easier to just accept "lz4", 
"lzf", "snappy".)

> Allow using short names for io compression codecs
> -
>
> Key: SPARK-2953
> URL: https://issues.apache.org/jira/browse/SPARK-2953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier 
> for users if Spark just accepts "lz4", "lzf", "snappy".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2953) Allow using short names for io compression codecs

2014-08-09 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14092034#comment-14092034
 ] 

Apache Spark commented on SPARK-2953:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1873

> Allow using short names for io compression codecs
> -
>
> Key: SPARK-2953
> URL: https://issues.apache.org/jira/browse/SPARK-2953
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier 
> for users if Spark just accepts "lz4", "lzf", "snappy".



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2947) DAGScheduler scheduling infinite loop

2014-08-09 Thread Guoqiang Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guoqiang Li updated SPARK-2947:
---

Summary: DAGScheduler scheduling infinite loop  (was: DAGScheduler 
scheduling dead cycle)

> DAGScheduler scheduling infinite loop
> -
>
> Key: SPARK-2947
> URL: https://issues.apache.org/jira/browse/SPARK-2947
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.2
>Reporter: Guoqiang Li
>Priority: Blocker
> Fix For: 1.1.0, 1.0.3
>
>
> Stage to resubmit more than 5 times.
> This seems to be caused by {{FetchFailed.bmAddress}} is null .
> I don't know how to reproduce it.
> master log:
> {noformat}
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:276 as 
> TID 52334 on executor 82: sanshan (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:276 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Starting task 1.189:277 as 
> TID 52335 on executor 78: tuan231 (PROCESS_LOCAL)
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Serialized task 1.189:277 as 
> 3060 bytes in 0 ms
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Lost TID 52199 (task 
> 1.189:141)
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 WARN scheduler.TaskSetManager: Loss was due to fetch 
> failure from null
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
>  -- 5 times ---
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Marking Stage 1 (distinct at 
> DealCF.scala:215) for resubmision due to a fetch failure
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: The failed fetch was from 
> Stage 2 (flatMap at DealCF.scala:207); marking it for resubmission
> 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
> 1.189, whose tasks have all completed, from pool 
> 14/08/09 21:50:17 INFO scheduler.TaskSetManager: Finished TID 1869 in 87398 
> ms on jilin (progress: 280/280)
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Completed ShuffleMapTask(2, 
> 269)
> 14/08/09 21:50:17 INFO cluster.YarnClientClusterScheduler: Removed TaskSet 
> 2.1, whose tasks have all completed, from pool 
> 14/08/09 21:50:17 INFO scheduler.DAGScheduler: Stage 2 (flatMap at 
> DealCF.scala:207) finished in 129.544 s
> {noformat}
> worker: log
> {noformat}
> /1408/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_57 not found, 
> computing it
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_191 not found, 
> computing it
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18017
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18017
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18151
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18151
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_86 not found, 
> computing it
> 14/08/09 21:49:41 INFO spark.CacheManager: Partition rdd_23_220 not found, 
> computing it
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18285
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18285
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
> 14/08/09 21:49:41 INFO executor.CoarseGrainedExecutorBackend: Got assigned 
> task 18419
> 14/08/09 21:49:41 INFO executor.Executor: Running task ID 18419
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_0 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_1 locally
> 14/08/09 21:49:41 INFO storage.BlockManager: Found block broadcast_2 locally
>