[jira] [Updated] (SPARK-3015) Removing broadcast in quick successions causes Akka timeout

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3015:
---

Affects Version/s: (was: 1.0.2)
   1.1.0

> Removing broadcast in quick successions causes Akka timeout
> ---
>
> Key: SPARK-3015
> URL: https://issues.apache.org/jira/browse/SPARK-3015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Standalone EC2 Spark shell
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> This issue is originally reported in SPARK-2916 in the context of MLLib, but 
> we were able to reproduce it using a simple Spark shell command:
> {code}
> (1 to 1).foreach { i => sc.parallelize(1 to 1000, 48).sum }
> {code}
> We still do not have a full understanding of the issue, but we have gleaned 
> the following information so far. When the driver runs a GC, it attempts to 
> clean up all the broadcast blocks that go out of scope at once. This causes 
> the driver to send out many blocking RemoveBroadcast messages to the 
> executors, which in turn send out blocking UpdateBlockInfo messages back to 
> the driver. Both of these calls block until they receive the expected 
> responses. We suspect that the high frequency at which we send these blocking 
> messages is the cause of either dropped messages or internal deadlock 
> somewhere.
> Unfortunately, it is highly difficult to reproduce depending on the 
> environment. We have been able to reproduce it on a 6-node cluster in 
> us-west-2, but not in us-west-1, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3015) Removing broadcast in quick successions causes Akka timeout

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099517#comment-14099517
 ] 

Patrick Wendell commented on SPARK-3015:


[~andrewor] I changed "Affects version/s" to 1.1.0 instead of 1.0.2 because I 
don't think this issue was ever seen in Spark 1.0.2. Is that correct?

> Removing broadcast in quick successions causes Akka timeout
> ---
>
> Key: SPARK-3015
> URL: https://issues.apache.org/jira/browse/SPARK-3015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: Standalone EC2 Spark shell
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> This issue is originally reported in SPARK-2916 in the context of MLLib, but 
> we were able to reproduce it using a simple Spark shell command:
> {code}
> (1 to 1).foreach { i => sc.parallelize(1 to 1000, 48).sum }
> {code}
> We still do not have a full understanding of the issue, but we have gleaned 
> the following information so far. When the driver runs a GC, it attempts to 
> clean up all the broadcast blocks that go out of scope at once. This causes 
> the driver to send out many blocking RemoveBroadcast messages to the 
> executors, which in turn send out blocking UpdateBlockInfo messages back to 
> the driver. Both of these calls block until they receive the expected 
> responses. We suspect that the high frequency at which we send these blocking 
> messages is the cause of either dropped messages or internal deadlock 
> somewhere.
> Unfortunately, it is highly difficult to reproduce depending on the 
> environment. We have been able to reproduce it on a 6-node cluster in 
> us-west-2, but not in us-west-1, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2916.


Resolution: Fixed

Fixed by virtue of SPARK-3015

> [MLlib] While running regression tests with dense vectors of length greater 
> than 1000, the treeAggregate blows up after several iterations
> --
>
> Key: SPARK-2916
> URL: https://issues.apache.org/jira/browse/SPARK-2916
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Reporter: Burak Yavuz
>Priority: Blocker
>
> While running any of the regression algorithms with gradient descent, the 
> treeAggregate blows up after several iterations.
> Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000
> In order to replicate the problem, use aggregate multiple times, maybe over 
> 50-60 times.
> Testing lead to the possible workaround:
> setting 
> `spark.cleaner.referenceTracking false`
> seems to help. So the problem is most probably related to the cleanup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3015) Removing broadcast in quick successions causes Akka timeout

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3015.


Resolution: Fixed

Issue resolved by pull request 1931
[https://github.com/apache/spark/pull/1931]

> Removing broadcast in quick successions causes Akka timeout
> ---
>
> Key: SPARK-3015
> URL: https://issues.apache.org/jira/browse/SPARK-3015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Standalone EC2 Spark shell
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> This issue is originally reported in SPARK-2916 in the context of MLLib, but 
> we were able to reproduce it using a simple Spark shell command:
> {code}
> (1 to 1).foreach { i => sc.parallelize(1 to 1000, 48).sum }
> {code}
> We still do not have a full understanding of the issue, but we have gleaned 
> the following information so far. When the driver runs a GC, it attempts to 
> clean up all the broadcast blocks that go out of scope at once. This causes 
> the driver to send out many blocking RemoveBroadcast messages to the 
> executors, which in turn send out blocking UpdateBlockInfo messages back to 
> the driver. Both of these calls block until they receive the expected 
> responses. We suspect that the high frequency at which we send these blocking 
> messages is the cause of either dropped messages or internal deadlock 
> somewhere.
> Unfortunately, it is highly difficult to reproduce depending on the 
> environment. We have been able to reproduce it on a 6-node cluster in 
> us-west-2, but not in us-west-1, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3015) Removing broadcast in quick successions causes Akka timeout

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3015?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3015:
---

Assignee: Andrew Or

> Removing broadcast in quick successions causes Akka timeout
> ---
>
> Key: SPARK-3015
> URL: https://issues.apache.org/jira/browse/SPARK-3015
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
> Environment: Standalone EC2 Spark shell
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> This issue is originally reported in SPARK-2916 in the context of MLLib, but 
> we were able to reproduce it using a simple Spark shell command:
> {code}
> (1 to 1).foreach { i => sc.parallelize(1 to 1000, 48).sum }
> {code}
> We still do not have a full understanding of the issue, but we have gleaned 
> the following information so far. When the driver runs a GC, it attempts to 
> clean up all the broadcast blocks that go out of scope at once. This causes 
> the driver to send out many blocking RemoveBroadcast messages to the 
> executors, which in turn send out blocking UpdateBlockInfo messages back to 
> the driver. Both of these calls block until they receive the expected 
> responses. We suspect that the high frequency at which we send these blocking 
> messages is the cause of either dropped messages or internal deadlock 
> somewhere.
> Unfortunately, it is highly difficult to reproduce depending on the 
> environment. We have been able to reproduce it on a 6-node cluster in 
> us-west-2, but not in us-west-1, for instance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3077) ChiSqTest bugs

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099498#comment-14099498
 ] 

Apache Spark commented on SPARK-3077:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1982

> ChiSqTest bugs
> --
>
> Key: SPARK-3077
> URL: https://issues.apache.org/jira/browse/SPARK-3077
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Doris Xin
>Assignee: Xiangrui Meng
>
> - promote nullHypothesis field in ChiSqTestResult to TestResult. Every test 
> should have a null hypothesis
> - Correct null hypothesis statement for independence test
> - line 59 in TestResult: 0.05 -> 0.5



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3077) ChiSqTest bugs

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3077?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3077:
-

Assignee: Xiangrui Meng

> ChiSqTest bugs
> --
>
> Key: SPARK-3077
> URL: https://issues.apache.org/jira/browse/SPARK-3077
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Reporter: Doris Xin
>Assignee: Xiangrui Meng
>
> - promote nullHypothesis field in ChiSqTestResult to TestResult. Every test 
> should have a null hypothesis
> - Correct null hypothesis statement for independence test
> - line 59 in TestResult: 0.05 -> 0.5



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3001) Improve Spearman's correlation

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3001.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1917
[https://github.com/apache/spark/pull/1917]

> Improve Spearman's correlation
> --
>
> Key: SPARK-3001
> URL: https://issues.apache.org/jira/browse/SPARK-3001
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.1.0
>
>
> The current implementation requires sorting individual columns, which could 
> be done with a global sort.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3078) Make LRWithLBFGS API consistent with others

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3078.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1973
[https://github.com/apache/spark/pull/1973]

> Make LRWithLBFGS API consistent with others
> ---
>
> Key: SPARK-3078
> URL: https://issues.apache.org/jira/browse/SPARK-3078
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 1.1.0
>
>
> Should ask users to use optimizer to set parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3038) delete history server logs when there are too many logs

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099485#comment-14099485
 ] 

Apache Spark commented on SPARK-3038:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/1981

> delete history server logs when there are too many logs 
> 
>
> Key: SPARK-3038
> URL: https://issues.apache.org/jira/browse/SPARK-3038
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.1
>Reporter: wangfei
> Fix For: 1.1.0
>
>
> enhance history server to delete logs automatically
> 1 use spark.history.deletelogs.enable to enable this function
> 2 when app logs num is greater than spark.history.maxsavedapplication, delete 
> the older logs 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2750) Add Https support for Web UI

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099477#comment-14099477
 ] 

Apache Spark commented on SPARK-2750:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/1980

> Add Https support for Web UI
> 
>
> Key: SPARK-2750
> URL: https://issues.apache.org/jira/browse/SPARK-2750
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: WangTaoTheTonic
>  Labels: https, ssl, webui
> Fix For: 1.0.3
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now I try to add https support for web ui using Jetty ssl integration.Below 
> is the plan:
> 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
> can switch between https and http by configure "spark.http.policy" in JVM 
> property for each process, while choose http by default.
> 2.Web port of Master and worker would be decided in order of launch 
> arguments, JVM property, System Env and default port.
> 3.Below is some other configuration items:
> spark.ssl.server.keystore.location The file or URL of the SSL Key store
> spark.ssl.server.keystore.password  The password for the key store
> spark.ssl.server.keystore.keypassword The password (if any) for the specific 
> key within the key store
> spark.ssl.server.keystore.type The type of the key store (default "JKS")
> spark.client.https.need-auth True if SSL needs client authentication
> spark.ssl.server.truststore.location The file name or URL of the trust store 
> location
> spark.ssl.server.truststore.password The password for the trust store
> spark.ssl.server.truststore.type The type of the trust store (default "JKS")
> Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-08-15 Thread Larry Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099476#comment-14099476
 ] 

Larry Xiao commented on SPARK-1987:
---

ok. I understand.
I'll try to implement it

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3081) Rename RandomRDDGenerators to RandomRDDs

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099449#comment-14099449
 ] 

Apache Spark commented on SPARK-3081:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1979

> Rename RandomRDDGenerators to RandomRDDs
> 
>
> Key: SPARK-3081
> URL: https://issues.apache.org/jira/browse/SPARK-3081
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> `RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its 
> methods return RDD. So a more proper and shorter name would be `RandomRDDs`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3081) Rename RandomRDDGenerators to RandomRDDs

2014-08-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3081:


 Summary: Rename RandomRDDGenerators to RandomRDDs
 Key: SPARK-3081
 URL: https://issues.apache.org/jira/browse/SPARK-3081
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


`RandomRDDGenerators` means factory for `RandomRDDGenerator`. However, its 
methods return RDD. So a more proper and shorter name would be `RandomRDDs`.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1987) More memory-efficient graph construction

2014-08-15 Thread Ankur Dave (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099438#comment-14099438
 ] 

Ankur Dave commented on SPARK-1987:
---

[~larryxiao] I was thinking of sorting the 3 primitive arrays directly rather 
than first putting them into an array of Edge objects. Though each edge will 
then be spread across 3 arrays, I think it shouldn't hurt locality too much, 
since we already have to access 2 memory locations per edge (the pointer and 
the referenced Edge object). Also, it will be more compact and hopefully make 
better use of the cache.

> More memory-efficient graph construction
> 
>
> Key: SPARK-1987
> URL: https://issues.apache.org/jira/browse/SPARK-1987
> Project: Spark
>  Issue Type: Improvement
>  Components: GraphX
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> A graph's edges are usually the largest component of the graph. GraphX 
> currently stores edges in parallel primitive arrays, so each edge should only 
> take 20 bytes to store (srcId: Long, dstId: Long, attr: Int). However, the 
> current implementation in EdgePartitionBuilder uses an array of Edge objects 
> as an intermediate representation for sorting, so each edge additionally 
> takes about 40 bytes during graph construction (srcId (8) + dstId (8) + attr 
> (4) + uncompressed pointer (8) + object overhead (8) + padding (4)). This 
> unnecessarily increases GraphX's memory requirements by a factor of 3.
> To save memory, EdgePartitionBuilder should instead use a custom sort routine 
> that operates directly on the three parallel arrays.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-08-15 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-3080:
---

Description: 
The stack trace is below:

{quote}
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
{quote}
This happened after the dataset was sub-sampled. 
Dataset properties: ~12B ratings


  was:
The stack trace is below:

```
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

[jira] [Updated] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-08-15 Thread Burak Yavuz (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3080?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Burak Yavuz updated SPARK-3080:
---

Description: 
The stack trace is below:

{quote}
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
{quote}
This happened after the dataset was sub-sampled. 
Dataset properties: ~12B ratings
Setup: 55 r3.8xlarge ec2 instances

  was:
The stack trace is below:

{quote}
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD

[jira] [Created] (SPARK-3080) ArrayIndexOutOfBoundsException in ALS for Large datasets

2014-08-15 Thread Burak Yavuz (JIRA)
Burak Yavuz created SPARK-3080:
--

 Summary: ArrayIndexOutOfBoundsException in ALS for Large datasets
 Key: SPARK-3080
 URL: https://issues.apache.org/jira/browse/SPARK-3080
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Burak Yavuz


The stack trace is below:

```
java.lang.ArrayIndexOutOfBoundsException: 2716

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:543)
scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)

org.apache.spark.mllib.recommendation.ALS.org$apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:537)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:505)

org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:504)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)

org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
scala.collection.Iterator$$anon$11.next(Iterator.scala:328)

org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)

scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
```
This happened after the dataset was sub-sampled. 
Dataset properties: ~12B ratings




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3046) Set executor's class loader as the default serializer class loader

2014-08-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-3046.


   Resolution: Fixed
Fix Version/s: 1.1.0

> Set executor's class loader as the default serializer class loader
> --
>
> Key: SPARK-3046
> URL: https://issues.apache.org/jira/browse/SPARK-3046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.1.0
>
>
> This is an attempt to fix the problem outlined in SPARK-2878.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3046) Set executor's class loader as the default serializer class loader

2014-08-15 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3046:
---

Component/s: Spark Core

> Set executor's class loader as the default serializer class loader
> --
>
> Key: SPARK-3046
> URL: https://issues.apache.org/jira/browse/SPARK-3046
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Fix For: 1.1.0
>
>
> This is an attempt to fix the problem outlined in SPARK-2878.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099400#comment-14099400
 ] 

Apache Spark commented on SPARK-3073:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1978

> improve large sort (external sort) for PySpark
> --
>
> Key: SPARK-3073
> URL: https://issues.apache.org/jira/browse/SPARK-3073
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3074) support groupByKey() with hot keys in PySpark

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099399#comment-14099399
 ] 

Apache Spark commented on SPARK-3074:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/1977

> support groupByKey() with hot keys in PySpark
> -
>
> Key: SPARK-3074
> URL: https://issues.apache.org/jira/browse/SPARK-3074
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099365#comment-14099365
 ] 

Apache Spark commented on SPARK-2977:
-

User 'JoshRosen' has created a pull request for this issue:
https://github.com/apache/spark/pull/1976

> Fix handling of short shuffle manager names in ShuffleBlockManager
> --
>
> Key: SPARK-2977
> URL: https://issues.apache.org/jira/browse/SPARK-2977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Assignee: Josh Rosen
>Priority: Critical
>
> Since we allow short names for {{spark.shuffle.manager}}, all code that reads 
> that configuration property should be prepared to handle the short names.
> See my comment at 
> https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this 
> as a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2014-08-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099333#comment-14099333
 ] 

Reynold Xin commented on SPARK-1476:


Let's work together to get something for 1.2 or 1.3. 



> 2GB limit in spark for blocks
> -
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: all
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-1476) 2GB limit in spark for blocks

2014-08-15 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099333#comment-14099333
 ] 

Reynold Xin edited comment on SPARK-1476 at 8/15/14 11:12 PM:
--

Let's work together to get something for 1.2 or 1.3.  At the very least, I 
would like to have a buffer abstraction layer that can support this in the 
future.


was (Author: rxin):
Let's work together to get something for 1.2 or 1.3. 



> 2GB limit in spark for blocks
> -
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: all
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-15 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099317#comment-14099317
 ] 

Xiangrui Meng commented on SPARK-2944:
--

I changed the priority to Major because I couldn't re-produce the bug in a 
deterministic way, nor I could verify whether this is an issue introduced after 
v1.0. It seems that it only happens when each task is very small.

> sc.makeRDD doesn't distribute partitions evenly
> ---
>
> Key: SPARK-2944
> URL: https://issues.apache.org/jira/browse/SPARK-2944
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> 16 nodes EC2 cluster:
> {code}
> val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
> rdd.count()
> {code}
> Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2944) sc.makeRDD doesn't distribute partitions evenly

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-2944:
-

Priority: Major  (was: Blocker)

> sc.makeRDD doesn't distribute partitions evenly
> ---
>
> Key: SPARK-2944
> URL: https://issues.apache.org/jira/browse/SPARK-2944
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> 16 nodes EC2 cluster:
> {code}
> val rdd = sc.makeRDD(0 until 1e9.toInt, 1000).cache()
> rdd.count()
> {code}
> Saw 156 partitions on one node while only 8 partitions on another.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3079) Hive build should depend on parquet serdes

2014-08-15 Thread Patrick Wendell (JIRA)
Patrick Wendell created SPARK-3079:
--

 Summary: Hive build should depend on parquet serdes
 Key: SPARK-3079
 URL: https://issues.apache.org/jira/browse/SPARK-3079
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Patrick Wendell
Assignee: Patrick Wendell


This will allow people to read parquet hive tables out of the box. Also, I 
think there are no transitive dependencies (I need to audit this) to worry 
about.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3042) DecisionTree filtering is very inefficient

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099303#comment-14099303
 ] 

Apache Spark commented on SPARK-3042:
-

User 'jkbradley' has created a pull request for this issue:
https://github.com/apache/spark/pull/1975

> DecisionTree filtering is very inefficient
> --
>
> Key: SPARK-3042
> URL: https://issues.apache.org/jira/browse/SPARK-3042
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> DecisionTree needs to match each example to a node at each iteration.  It 
> currently does this with a set of filters very inefficiently: For each 
> example, it examines each node at the current level and traces up to the root 
> to see if that example should be handled by that node.
> Proposed fix: Filter top-down using the partly built tree itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2406) Partitioned Parquet Support

2014-08-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2406:


Priority: Blocker  (was: Critical)

> Partitioned Parquet Support
> ---
>
> Key: SPARK-2406
> URL: https://issues.apache.org/jira/browse/SPARK-2406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3076) Gracefully report build timeouts in Jenkins

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099289#comment-14099289
 ] 

Apache Spark commented on SPARK-3076:
-

User 'nchammas' has created a pull request for this issue:
https://github.com/apache/spark/pull/1974

> Gracefully report build timeouts in Jenkins
> ---
>
> Key: SPARK-3076
> URL: https://issues.apache.org/jira/browse/SPARK-3076
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Copy of dev list thread:
> {quote}
> Jenkins runs for this PR https://github.com/apache/spark/pull/1960 timed out 
> without notification. The relevant Jenkins logs are at
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull
> On Fri, Aug 15, 2014 at 11:44 AM, Nicholas Chammas 
>  wrote:
> Shivaram,
> Can you point us to an example of that happening? The Jenkins console output, 
> that is.
> Nick
> On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman 
>  wrote:
> Also I think Jenkins doesn't post build timeouts to github. Is there anyway
> we can fix that ?
> On Aug 15, 2014 9:04 AM, "Patrick Wendell"  wrote:
> > Hi All,
> >
> > I noticed that all PR tests run overnight had failed due to timeouts. The
> > patch that updates the netty shuffle I believe somehow inflated to the
> > build time significantly. That patch had been tested, but one change was
> > made before it was merged that was not tested.
> >
> > I've reverted the patch for now to see if it brings the build times back
> > down.
> >
> > - Patrick
> >
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2406) Partitioned Parquet Support

2014-08-15 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2406:


Target Version/s: 1.1.0  (was: 1.2.0)

> Partitioned Parquet Support
> ---
>
> Key: SPARK-2406
> URL: https://issues.apache.org/jira/browse/SPARK-2406
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Michael Armbrust
>Priority: Critical
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2916) [MLlib] While running regression tests with dense vectors of length greater than 1000, the treeAggregate blows up after several iterations

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099274#comment-14099274
 ] 

Patrick Wendell commented on SPARK-2916:


Just to document for posterity - this was narrowed down and is just a symptom 
of SPARK-3015.

> [MLlib] While running regression tests with dense vectors of length greater 
> than 1000, the treeAggregate blows up after several iterations
> --
>
> Key: SPARK-2916
> URL: https://issues.apache.org/jira/browse/SPARK-2916
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, Spark Core
>Reporter: Burak Yavuz
>Priority: Blocker
>
> While running any of the regression algorithms with gradient descent, the 
> treeAggregate blows up after several iterations.
> Observed on EC2 cluster with 16 nodes, matrix dimensions of 1,000,000 x 5,000
> In order to replicate the problem, use aggregate multiple times, maybe over 
> 50-60 times.
> Testing lead to the possible workaround:
> setting 
> `spark.cleaner.referenceTracking false`
> seems to help. So the problem is most probably related to the cleanup.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2883) Spark Support for ORCFile format

2014-08-15 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099275#comment-14099275
 ] 

Zhan Zhang commented on SPARK-2883:
---

Spark with Hive12 can operate Orc table through Spark-hive, but cannot operate 
the ORCFile through RDD, because the OrcStruct is not exposing its API. Hive13 
is fine. But spark currently not support hive13. Eventually, we want ORCFile 
support the functionalities in the same level as parquet.

> Spark Support for ORCFile format
> 
>
> Key: SPARK-2883
> URL: https://issues.apache.org/jira/browse/SPARK-2883
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output, SQL
>Reporter: Zhan Zhang
>
> Verify the support of OrcInputFormat in spark, fix issues if exists and add 
> documentation of its usage.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3025) Allow JDBC clients to set a fair scheduler pool

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3025:
---

Priority: Blocker  (was: Major)

> Allow JDBC clients to set a fair scheduler pool
> ---
>
> Key: SPARK-3025
> URL: https://issues.apache.org/jira/browse/SPARK-3025
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3078) Make LRWithLBFGS API consistent with others

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099271#comment-14099271
 ] 

Apache Spark commented on SPARK-3078:
-

User 'mengxr' has created a pull request for this issue:
https://github.com/apache/spark/pull/1973

> Make LRWithLBFGS API consistent with others
> ---
>
> Key: SPARK-3078
> URL: https://issues.apache.org/jira/browse/SPARK-3078
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Should ask users to use optimizer to set parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3078) Make LRWithLBFGS API consistent with others

2014-08-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3078:


 Summary: Make LRWithLBFGS API consistent with others
 Key: SPARK-3078
 URL: https://issues.apache.org/jira/browse/SPARK-3078
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Affects Versions: 1.1.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Should ask users to use optimizer to set parameters.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2546) Configuration object thread safety issue

2014-08-15 Thread Andrew Ash (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099251#comment-14099251
 ] 

Andrew Ash commented on SPARK-2546:
---

Ok I'll stay on the lookout for this bug and ping here again if we observe
this.  Luckily we haven't seen this particular issue since, but that's
mostly been because other things are causing problems.

We have a few bugs now that are nondeterministically broken in Spark and
cause jobs to fail/hang, but if we retry the job several times (and
spark.speculation helps somewhat) we can usually eventually get a job to
complete.  I can share that list if you're interested of what's highest on
our minds right now.


On Fri, Aug 15, 2014 at 6:03 PM, Patrick Wendell (JIRA) 



> Configuration object thread safety issue
> 
>
> Key: SPARK-2546
> URL: https://issues.apache.org/jira/browse/SPARK-2546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
>
> // observed in 0.9.1 but expected to exist in 1.0.1 as well
> This ticket is copy-pasted from a thread on the dev@ list:
> {quote}
> We discovered a very interesting bug in Spark at work last week in Spark 
> 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
> thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
> Let me explain:
> Observations
>  - Was running a relatively simple job (read from Avro files, do a map, do 
> another map, write back to Avro files)
>  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
>  - The 412 successful tasks completed in median time 3.4s
>  - The last hung task didn't finish even in 20 hours
>  - The executor with the hung task was responsible for 100% of one core of 
> CPU usage
>  - Jstack of the executor attached (relevant thread pasted below)
> Diagnosis
> After doing some code spelunking, we determined the issue was concurrent use 
> of a Configuration object for each task on an executor.  In Hadoop each task 
> runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
> the single-threaded access assumptions of the Configuration object no longer 
> hold in Spark.
> The specific issue is that the AvroRecordReader actually _modifies_ the 
> JobConf it's given when it's instantiated!  It adds a key for the RPC 
> protocol engine in the process of connecting to the Hadoop FileSystem.  When 
> many tasks start at the same time (like at the start of a job), many tasks 
> are adding this configuration item to the one Configuration object at once.  
> Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
> The below post is an excellent explanation of what happens in the situation 
> where multiple threads insert into a HashMap at the same time.
> http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
> The gist is that you have a thread following a cycle of linked list nodes 
> indefinitely.  This exactly matches our observations of the 100% CPU core and 
> also the final location in the stack trace.
> So it seems the way Spark shares a Configuration object between task threads 
> in an executor is incorrect.  We need some way to prevent concurrent access 
> to a single Configuration object.
> Proposed fix
> We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
> its own JobConf object (and thus Configuration object).  The optimization of 
> broadcasting the Configuration object across the cluster can remain, but on 
> the other side I think it needs to be cloned for each task to allow for 
> concurrent access.  I'm not sure the performance implications, but the 
> comments suggest that the Configuration object is ~10KB so I would expect a 
> clone on the object to be relatively speedy.
> Has this been observed before?  Does my suggested fix make sense?  I'd be 
> happy to file a Jira ticket and continue discussion there for the right way 
> to fix.
> Thanks!
> Andrew
> P.S.  For others seeing this issue, our temporary workaround is to enable 
> spark.speculation, which retries failed (or hung) tasks on other machines.
> {noformat}
> "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000 
> nid=0x54b1 runnable [0x7f92d74f1000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.transfer(HashMap.java:601)
> at java.util.HashMap.resize(HashMap.java:581)
> at java.util.HashMap.addEntry(HashMap.java:879)
> at java.util.HashMap.put(HashMap.java:505)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
> at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:16

[jira] [Created] (SPARK-3077) ChiSqTest bugs

2014-08-15 Thread Doris Xin (JIRA)
Doris Xin created SPARK-3077:


 Summary: ChiSqTest bugs
 Key: SPARK-3077
 URL: https://issues.apache.org/jira/browse/SPARK-3077
 Project: Spark
  Issue Type: Bug
  Components: MLlib
Reporter: Doris Xin


- promote nullHypothesis field in ChiSqTestResult to TestResult. Every test 
should have a null hypothesis
- Correct null hypothesis statement for independence test
- line 59 in TestResult: 0.05 -> 0.5



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2914) spark.*.extraJavaOptions are evaluated too many times

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2914:
---

Priority: Critical  (was: Blocker)

> spark.*.extraJavaOptions are evaluated too many times
> -
>
> Key: SPARK-2914
> URL: https://issues.apache.org/jira/browse/SPARK-2914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Priority: Critical
> Fix For: 1.1.0
>
>
> If we pass the following to spark.executor.extraJavaOptions,
> {code}
> -Dthem.quotes="the \"best\" joke ever" -Dthem.backslashes=" \\ \\  "
> {code}
> These will first be escaped once when the SparkSubmit JVM is launched. This 
> becomes the following string.
> {code}
> scala> sc.getConf.get("spark.driver.extraJavaOptions")
> res0: String = -Dthem.quotes="the "best" joke ever" -Dthem.backslashes=" \ \ 
> \\ "
> {code}
> This will be split incorrectly by Utils.splitCommandString.
> Of course, this also affects spark.driver.extraJavaOptions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2914) spark.*.extraJavaOptions are evaluated too many times

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2914:
---

Priority: Blocker  (was: Major)

> spark.*.extraJavaOptions are evaluated too many times
> -
>
> Key: SPARK-2914
> URL: https://issues.apache.org/jira/browse/SPARK-2914
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>Priority: Blocker
> Fix For: 1.1.0
>
>
> If we pass the following to spark.executor.extraJavaOptions,
> {code}
> -Dthem.quotes="the \"best\" joke ever" -Dthem.backslashes=" \\ \\  "
> {code}
> These will first be escaped once when the SparkSubmit JVM is launched. This 
> becomes the following string.
> {code}
> scala> sc.getConf.get("spark.driver.extraJavaOptions")
> res0: String = -Dthem.quotes="the "best" joke ever" -Dthem.backslashes=" \ \ 
> \\ "
> {code}
> This will be split incorrectly by Utils.splitCommandString.
> Of course, this also affects spark.driver.extraJavaOptions.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2546) Configuration object thread safety issue

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099239#comment-14099239
 ] 

Patrick Wendell commented on SPARK-2546:


Hey Andrew I think due to us cutting SPARK-2585 from this release it will 
remain broken in Spark 1.1. We could look into a solution based on clone()'ing 
the conf for future patch releases in the 1.1 branch.

> Configuration object thread safety issue
> 
>
> Key: SPARK-2546
> URL: https://issues.apache.org/jira/browse/SPARK-2546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
>
> // observed in 0.9.1 but expected to exist in 1.0.1 as well
> This ticket is copy-pasted from a thread on the dev@ list:
> {quote}
> We discovered a very interesting bug in Spark at work last week in Spark 
> 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
> thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
> Let me explain:
> Observations
>  - Was running a relatively simple job (read from Avro files, do a map, do 
> another map, write back to Avro files)
>  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
>  - The 412 successful tasks completed in median time 3.4s
>  - The last hung task didn't finish even in 20 hours
>  - The executor with the hung task was responsible for 100% of one core of 
> CPU usage
>  - Jstack of the executor attached (relevant thread pasted below)
> Diagnosis
> After doing some code spelunking, we determined the issue was concurrent use 
> of a Configuration object for each task on an executor.  In Hadoop each task 
> runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
> the single-threaded access assumptions of the Configuration object no longer 
> hold in Spark.
> The specific issue is that the AvroRecordReader actually _modifies_ the 
> JobConf it's given when it's instantiated!  It adds a key for the RPC 
> protocol engine in the process of connecting to the Hadoop FileSystem.  When 
> many tasks start at the same time (like at the start of a job), many tasks 
> are adding this configuration item to the one Configuration object at once.  
> Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
> The below post is an excellent explanation of what happens in the situation 
> where multiple threads insert into a HashMap at the same time.
> http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
> The gist is that you have a thread following a cycle of linked list nodes 
> indefinitely.  This exactly matches our observations of the 100% CPU core and 
> also the final location in the stack trace.
> So it seems the way Spark shares a Configuration object between task threads 
> in an executor is incorrect.  We need some way to prevent concurrent access 
> to a single Configuration object.
> Proposed fix
> We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
> its own JobConf object (and thus Configuration object).  The optimization of 
> broadcasting the Configuration object across the cluster can remain, but on 
> the other side I think it needs to be cloned for each task to allow for 
> concurrent access.  I'm not sure the performance implications, but the 
> comments suggest that the Configuration object is ~10KB so I would expect a 
> clone on the object to be relatively speedy.
> Has this been observed before?  Does my suggested fix make sense?  I'd be 
> happy to file a Jira ticket and continue discussion there for the right way 
> to fix.
> Thanks!
> Andrew
> P.S.  For others seeing this issue, our temporary workaround is to enable 
> spark.speculation, which retries failed (or hung) tasks on other machines.
> {noformat}
> "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000 
> nid=0x54b1 runnable [0x7f92d74f1000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.transfer(HashMap.java:601)
> at java.util.HashMap.resize(HashMap.java:581)
> at java.util.HashMap.addEntry(HashMap.java:879)
> at java.util.HashMap.put(HashMap.java:505)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
> at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
> at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.j

[jira] [Commented] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099234#comment-14099234
 ] 

Patrick Wendell commented on SPARK-2585:


Unfortunately after a lot of effort we still can't get the test times down on 
this one and it's still unclear whether it will cause performance regressions.

Since this isn't particularly critical from a user perspective (it's mostly 
about simplifying internals) I think it's best to punt this to 1.2. One 
unfortunate thing is that it means SPARK-2546 will remain broken in 1.1.

> Remove special handling of Hadoop JobConf
> -
>
> Key: SPARK-2585
> URL: https://issues.apache.org/jira/browse/SPARK-2585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
> implementation does not use shared conf objects). We no longer need to 
> specially broadcast the Hadoop configuration since we are broadcasting RDD 
> data anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2546) Configuration object thread safety issue

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2546:
---

Target Version/s: 1.2.0  (was: 1.1.0)

> Configuration object thread safety issue
> 
>
> Key: SPARK-2546
> URL: https://issues.apache.org/jira/browse/SPARK-2546
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.1
>Reporter: Andrew Ash
>Assignee: Josh Rosen
>Priority: Critical
>
> // observed in 0.9.1 but expected to exist in 1.0.1 as well
> This ticket is copy-pasted from a thread on the dev@ list:
> {quote}
> We discovered a very interesting bug in Spark at work last week in Spark 
> 0.9.1 — that the way Spark uses the Hadoop Configuration object is prone to 
> thread safety issues.  I believe it still applies in Spark 1.0.1 as well.  
> Let me explain:
> Observations
>  - Was running a relatively simple job (read from Avro files, do a map, do 
> another map, write back to Avro files)
>  - 412 of 413 tasks completed, but the last task was hung in RUNNING state
>  - The 412 successful tasks completed in median time 3.4s
>  - The last hung task didn't finish even in 20 hours
>  - The executor with the hung task was responsible for 100% of one core of 
> CPU usage
>  - Jstack of the executor attached (relevant thread pasted below)
> Diagnosis
> After doing some code spelunking, we determined the issue was concurrent use 
> of a Configuration object for each task on an executor.  In Hadoop each task 
> runs in its own JVM, but in Spark multiple tasks can run in the same JVM, so 
> the single-threaded access assumptions of the Configuration object no longer 
> hold in Spark.
> The specific issue is that the AvroRecordReader actually _modifies_ the 
> JobConf it's given when it's instantiated!  It adds a key for the RPC 
> protocol engine in the process of connecting to the Hadoop FileSystem.  When 
> many tasks start at the same time (like at the start of a job), many tasks 
> are adding this configuration item to the one Configuration object at once.  
> Internally Configuration uses a java.lang.HashMap, which isn't threadsafe… 
> The below post is an excellent explanation of what happens in the situation 
> where multiple threads insert into a HashMap at the same time.
> http://mailinator.blogspot.com/2009/06/beautiful-race-condition.html
> The gist is that you have a thread following a cycle of linked list nodes 
> indefinitely.  This exactly matches our observations of the 100% CPU core and 
> also the final location in the stack trace.
> So it seems the way Spark shares a Configuration object between task threads 
> in an executor is incorrect.  We need some way to prevent concurrent access 
> to a single Configuration object.
> Proposed fix
> We can clone the JobConf object in HadoopRDD.getJobConf() so each task gets 
> its own JobConf object (and thus Configuration object).  The optimization of 
> broadcasting the Configuration object across the cluster can remain, but on 
> the other side I think it needs to be cloned for each task to allow for 
> concurrent access.  I'm not sure the performance implications, but the 
> comments suggest that the Configuration object is ~10KB so I would expect a 
> clone on the object to be relatively speedy.
> Has this been observed before?  Does my suggested fix make sense?  I'd be 
> happy to file a Jira ticket and continue discussion there for the right way 
> to fix.
> Thanks!
> Andrew
> P.S.  For others seeing this issue, our temporary workaround is to enable 
> spark.speculation, which retries failed (or hung) tasks on other machines.
> {noformat}
> "Executor task launch worker-6" daemon prio=10 tid=0x7f91f01fe000 
> nid=0x54b1 runnable [0x7f92d74f1000]
>java.lang.Thread.State: RUNNABLE
> at java.util.HashMap.transfer(HashMap.java:601)
> at java.util.HashMap.resize(HashMap.java:581)
> at java.util.HashMap.addEntry(HashMap.java:879)
> at java.util.HashMap.put(HashMap.java:505)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:803)
> at org.apache.hadoop.conf.Configuration.set(Configuration.java:783)
> at org.apache.hadoop.conf.Configuration.setClass(Configuration.java:1662)
> at org.apache.hadoop.ipc.RPC.setProtocolEngine(RPC.java:193)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNNProxyWithClientProtocol(NameNodeProxies.java:343)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:168)
> at 
> org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:129)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:436)
> at org.apache.hadoop.hdfs.DFSClient.(DFSClient.java:403)
> at 
> org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem

[jira] [Updated] (SPARK-2585) Remove special handling of Hadoop JobConf

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2585:
---

Target Version/s: 1.2.0  (was: 1.1.0)

> Remove special handling of Hadoop JobConf
> -
>
> Key: SPARK-2585
> URL: https://issues.apache.org/jira/browse/SPARK-2585
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>Priority: Critical
>
> This is a follow up to SPARK-2521 and should close SPARK-2546 (provided the 
> implementation does not use shared conf objects). We no longer need to 
> specially broadcast the Hadoop configuration since we are broadcasting RDD 
> data anyways.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3041) DecisionTree: isSampleValid indexing incorrect

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3041.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

> DecisionTree: isSampleValid indexing incorrect
> --
>
> Key: SPARK-3041
> URL: https://issues.apache.org/jira/browse/SPARK-3041
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.1.0
>
>
> In DecisionTree, isSampleValid treats unordered categorical features 
> incorrectly: It treated the bins as if indexed by featured values, rather 
> than by subsets of values/categories.
> This bug is exhibited for unordered features (multi-class classification with 
> categorical features of low arity).
> Proposed fix: Index bins correctly for unordered categorical features.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3022) FindBinsForLevel in decision tree should call findBin only once for each feature

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3022:
-

Target Version/s: 1.1.0  (was: 1.0.2)

> FindBinsForLevel in decision tree should call findBin only once for each 
> feature
> 
>
> Key: SPARK-3022
> URL: https://issues.apache.org/jira/browse/SPARK-3022
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
>Assignee: Qiping Li
> Fix For: 1.1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all 
> nodes at a given level. Given a specific `LabeledPoint` and a specific 
> feature, the bin to put this labeled point should always be same.But in 
> current implementation, `findBin` on a (labeledpoint, feature) pair is called 
> for every node at a given level, which is a waste of computation. I proposed 
> to call `findBin` only once and if a `LabeledPoint` is valid on a node, this 
> result can be reused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3022) FindBinsForLevel in decision tree should call findBin only once for each feature

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3022:
-

Assignee: Qiping Li

> FindBinsForLevel in decision tree should call findBin only once for each 
> feature
> 
>
> Key: SPARK-3022
> URL: https://issues.apache.org/jira/browse/SPARK-3022
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
>Assignee: Qiping Li
> Fix For: 1.1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all 
> nodes at a given level. Given a specific `LabeledPoint` and a specific 
> feature, the bin to put this labeled point should always be same.But in 
> current implementation, `findBin` on a (labeledpoint, feature) pair is called 
> for every node at a given level, which is a waste of computation. I proposed 
> to call `findBin` only once and if a `LabeledPoint` is valid on a node, this 
> result can be reused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3022) FindBinsForLevel in decision tree should call findBin only once for each feature

2014-08-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-3022.
--

   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1950
[https://github.com/apache/spark/pull/1950]

> FindBinsForLevel in decision tree should call findBin only once for each 
> feature
> 
>
> Key: SPARK-3022
> URL: https://issues.apache.org/jira/browse/SPARK-3022
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.0.2
>Reporter: Qiping Li
> Fix For: 1.1.0
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> `findbinsForLevel` is applied to every `LabeledPoint` to find bins for all 
> nodes at a given level. Given a specific `LabeledPoint` and a specific 
> feature, the bin to put this labeled point should always be same.But in 
> current implementation, `findBin` on a (labeledpoint, feature) pair is called 
> for every node at a given level, which is a waste of computation. I proposed 
> to call `findBin` only once and if a `LabeledPoint` is valid on a node, this 
> result can be reused.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2044) Pluggable interface for shuffles

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099136#comment-14099136
 ] 

Patrick Wendell commented on SPARK-2044:


A lot of this has been fixed in 1.1 so I moved target version to 1.2. [~matei] 
we can also close this with fixVersion=1.1.0 if you consider the initial issue 
fixed.

> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3010) fix redundant conditional

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3010.


Resolution: Won't Fix

PR was closed by the user.

> fix redundant conditional
> -
>
> Key: SPARK-3010
> URL: https://issues.apache.org/jira/browse/SPARK-3010
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: wangfei
> Fix For: 1.1.0
>
>
> there are some redundant conditional in spark, such as 
> 1.
> private[spark] def codegenEnabled: Boolean =
>   if (getConf(CODEGEN_ENABLED, "false") == "true") true else false
> 2.
> x => if (x == 2) true else false
> ... etc



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2044) Pluggable interface for shuffles

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2044:
---

Target Version/s: 1.2.0  (was: 1.1.0)

> Pluggable interface for shuffles
> 
>
> Key: SPARK-2044
> URL: https://issues.apache.org/jira/browse/SPARK-2044
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Matei Zaharia
>Assignee: Matei Zaharia
> Attachments: Pluggableshuffleproposal.pdf
>
>
> Given that a lot of the current activity in Spark Core is in shuffles, I 
> wanted to propose factoring out shuffle implementations in a way that will 
> make experimentation easier. Ideally we will converge on one implementation, 
> but for a while, this could also be used to have several implementations 
> coexist. I'm suggesting this because I aware of at least three efforts to 
> look at shuffle (from Yahoo!, Intel and Databricks). Some of the things 
> people are investigating are:
> * Push-based shuffle where data moves directly from mappers to reducers
> * Sorting-based instead of hash-based shuffle, to create fewer files (helps a 
> lot with file handles and memory usage on large shuffles)
> * External spilling within a key
> * Changing the level of parallelism or even algorithm for downstream stages 
> at runtime based on statistics of the map output (this is a thing we had 
> prototyped in the Shark research project but never merged in core)
> I've attached a design doc with a proposed interface. It's not too crazy 
> because the interface between shuffles and the rest of the code is already 
> pretty narrow (just some iterators for reading data and a writer interface 
> for writing it). Bigger changes will be needed in the interaction with 
> DAGScheduler and BlockManager for some of the ideas above, but we can handle 
> those separately, and this interface will allow us to experiment with some 
> short-term stuff sooner.
> If things go well I'd also like to send a sort-based shuffle implementation 
> for 1.1, but we'll see how the timing on that works out.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2297) Make task attempt and speculation more explicit in UI

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2297.


Resolution: Fixed

I believe this was fixed in:
https://github.com/apache/spark/pull/1236

> Make task attempt and speculation more explicit in UI
> -
>
> Key: SPARK-2297
> URL: https://issues.apache.org/jira/browse/SPARK-2297
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 1.0.0
>Reporter: Reynold Xin
>Assignee: Reynold Xin
> Attachments: Screen Shot 2014-06-26 at 1.43.52 PM.png
>
>
> It is fairly hard to tell why a task was launched (was it speculation, or 
> retry for failed tasks).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2977) Fix handling of short shuffle manager names in ShuffleBlockManager

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2977?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2977:
---

Priority: Critical  (was: Major)

> Fix handling of short shuffle manager names in ShuffleBlockManager
> --
>
> Key: SPARK-2977
> URL: https://issues.apache.org/jira/browse/SPARK-2977
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
>Reporter: Josh Rosen
>Priority: Critical
>
> Since we allow short names for {{spark.shuffle.manager}}, all code that reads 
> that configuration property should be prepared to handle the short names.
> See my comment at 
> https://github.com/apache/spark/pull/1799#discussion_r16029607 (opening this 
> as a JIRA so we don't forget to fix it).



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2532) Fix issues with consolidated shuffle

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2532:
---

Target Version/s: 1.2.0  (was: 1.1.0)

> Fix issues with consolidated shuffle
> 
>
> Key: SPARK-2532
> URL: https://issues.apache.org/jira/browse/SPARK-2532
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.1.0
> Environment: All
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
>
> Will file PR with changes as soon as merge is done (earlier merge became 
> outdated in 2 weeks unfortunately :) ).
> Consolidated shuffle is broken in multiple ways in spark :
> a) Task failure(s) can cause the state to become inconsistent.
> b) Multiple revert's or combination of close/revert/close can cause the state 
> to be inconsistent.
> (As part of exception/error handling).
> c) Some of the api in block writer causes implementation issues - for 
> example: a revert is always followed by close : but the implemention tries to 
> keep them separate, resulting in surface for errors.
> d) Fetching data from consolidated shuffle files can go badly wrong if the 
> file is being actively written to : it computes length by subtracting next 
> offset from current offset (or length if this is last offset)- the latter 
> fails when fetch is happening in parallel to write.
> Note, this happens even if there are no task failures of any kind !
> This usually results in stream corruption or decompression errors.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2089) With YARN, preferredNodeLocalityData isn't honored

2014-08-15 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2089?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099115#comment-14099115
 ] 

Mridul Muralidharan commented on SPARK-2089:



For a general case, wont InputFormat's not have customizations to them for 
creation and/or initialization before they can be used to get splits ? (other 
than file names I mean).


> With YARN, preferredNodeLocalityData isn't honored 
> ---
>
> Key: SPARK-2089
> URL: https://issues.apache.org/jira/browse/SPARK-2089
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.0
>Reporter: Sandy Ryza
>Assignee: Sandy Ryza
>Priority: Critical
>
> When running in YARN cluster mode, apps can pass preferred locality data when 
> constructing a Spark context that will dictate where to request executor 
> containers.
> This is currently broken because of a race condition.  The Spark-YARN code 
> runs the user class and waits for it to start up a SparkContext.  During its 
> initialization, the SparkContext will create a YarnClusterScheduler, which 
> notifies a monitor in the Spark-YARN code that .  The Spark-Yarn code then 
> immediately fetches the preferredNodeLocationData from the SparkContext and 
> uses it to start requesting containers.
> But in the SparkContext constructor that takes the preferredNodeLocationData, 
> setting preferredNodeLocationData comes after the rest of the initialization, 
> so, if the Spark-YARN code comes around quickly enough after being notified, 
> the data that's fetched is the empty unset version.  The occurred during all 
> of my runs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3075:
---

Target Version/s: 1.2.0
   Fix Version/s: (was: 1.2.0)

> Expose a way for users to parse event logs
> --
>
> Key: SPARK-3075
> URL: https://issues.apache.org/jira/browse/SPARK-3075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
>
> Both ReplayListenerBus and util.JsonProtocol are private[spark], so the user 
> wants to parse the event logs themselves for analytics they will have to 
> write their own JSON deserializers (or do some crazy reflection to access 
> these methods). We should expose an easy way for them to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2858) Default log4j configuration no longer seems to work

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098148#comment-14098148
 ] 

Patrick Wendell edited comment on SPARK-2858 at 8/15/14 8:48 PM:
-

What I mean is that when I don't include any log4j.properties file, Spark fails 
to use its own default configuration.


was (Author: pwendell):
What I mean is that when I don't include any log4j.properties file, Spark fails 
to use it's own default configuration.

> Default log4j configuration no longer seems to work
> ---
>
> Key: SPARK-2858
> URL: https://issues.apache.org/jira/browse/SPARK-2858
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>
> For reasons unknown this doesn't seem to be working anymore. I deleted my 
> log4j.properties file and did a fresh build and it noticed it still gave me a 
> verbose stack trace when port 4040 was contented (which is a log we silence 
> in the conf). I actually think this was an issue even before [~sowen]'s 
> changes, so not sure what's up.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-08-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099089#comment-14099089
 ] 

Josh Rosen commented on SPARK-922:
--

Yeah, you still need to set PYSPARK_PYTHON since this doesn't overwrite the 
system Python.  I was updating this to brain-dump the script I'm using for a 
Python 2.6 vs Python 2.7 benchmark.

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1476) 2GB limit in spark for blocks

2014-08-15 Thread Mridul Muralidharan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099072#comment-14099072
 ] 

Mridul Muralidharan commented on SPARK-1476:


Based on discussions we had with others, apparently 1.1 was not a good vehicle 
for this proposal.
Further, since there was no interest in this jira/comments on the proposal, we 
put the effort on the backburner.

We plan to push atleast some of the bugs fixed as part of this effort - 
consolidated shuffle did get resolved in 1.1 and probably a few more might be 
contributed back in 1.2 time permitting (disk backed map output tracking for 
example looks like a good candidate).
But bulk of the change is pervasive and at times a bit invasive and at odds 
with some of the other changes (for example, zero-copy); shepherding it might 
be a bit time consuming for me given other deliverables.

If there is renewed interest in this to get it integrated into a spark release, 
I can try to push for it to be resurrected and submitted.

> 2GB limit in spark for blocks
> -
>
> Key: SPARK-1476
> URL: https://issues.apache.org/jira/browse/SPARK-1476
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
> Environment: all
>Reporter: Mridul Muralidharan
>Assignee: Mridul Muralidharan
>Priority: Critical
> Attachments: 2g_fix_proposal.pdf
>
>
> The underlying abstraction for blocks in spark is a ByteBuffer : which limits 
> the size of the block to 2GB.
> This has implication not just for managed blocks in use, but also for shuffle 
> blocks (memory mapped blocks are limited to 2gig, even though the api allows 
> for long), ser-deser via byte array backed outstreams (SPARK-1391), etc.
> This is a severe limitation for use of spark when used on non trivial 
> datasets.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3076) Gracefully report build timeouts in Jenkins

2014-08-15 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-3076:
---

 Summary: Gracefully report build timeouts in Jenkins
 Key: SPARK-3076
 URL: https://issues.apache.org/jira/browse/SPARK-3076
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


Copy of dev list thread:

{quote}
Jenkins runs for this PR https://github.com/apache/spark/pull/1960 timed out 
without notification. The relevant Jenkins logs are at

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18588/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18592/consoleFull
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18597/consoleFull




On Fri, Aug 15, 2014 at 11:44 AM, Nicholas Chammas  
wrote:
Shivaram,

Can you point us to an example of that happening? The Jenkins console output, 
that is.

Nick


On Fri, Aug 15, 2014 at 2:28 PM, Shivaram Venkataraman 
 wrote:
Also I think Jenkins doesn't post build timeouts to github. Is there anyway
we can fix that ?
On Aug 15, 2014 9:04 AM, "Patrick Wendell"  wrote:

> Hi All,
>
> I noticed that all PR tests run overnight had failed due to timeouts. The
> patch that updates the netty shuffle I believe somehow inflated to the
> build time significantly. That patch had been tested, but one change was
> made before it was merged that was not tested.
>
> I've reverted the patch for now to see if it brings the build times back
> down.
>
> - Patrick
>
{quote}




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-08-15 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14099049#comment-14099049
 ] 

Nicholas Chammas commented on SPARK-922:


Josh, at the end of your updated script do we still also need the step to edit 
{{spark-env.sh}}?

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-1477) Add the lifecycle interface

2014-08-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1477?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-1477:
--

Target Version/s: 1.2.0  (was: 1.1.0)

Retargeting this to 1.2.0.

> Add the lifecycle interface
> ---
>
> Key: SPARK-1477
> URL: https://issues.apache.org/jira/browse/SPARK-1477
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.0, 1.0.1
>Reporter: Guoqiang Li
>Assignee: Guoqiang Li
>
> Now the Spark in the code, there are a lot of interface or class  defines the 
> stop and start 
> method,eg:[SchedulerBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/SchedulerBackend.scala],[HttpServer|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/HttpServer.scala],[ContextCleaner|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala]
>  . we should use a life cycle interface improve the code



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3033) [Hive] java.math.BigDecimal cannot be cast to org.apache.hadoop.hive.common.type.HiveDecimal

2014-08-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098986#comment-14098986
 ] 

Michael Armbrust commented on SPARK-3033:
-

Can you provide the query?

> [Hive] java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
> 
>
> Key: SPARK-3033
> URL: https://issues.apache.org/jira/browse/SPARK-3033
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.0.2
>Reporter: pengyanhong
>Priority: Blocker
>
> run a complex HiveQL via yarn-cluster, got error as below:
> {quote}
> 14/08/14 15:05:24 WARN 
> org.apache.spark.Logging$class.logWarning(Logging.scala:70): Loss was due to 
> java.lang.ClassCastException
> java.lang.ClassCastException: java.math.BigDecimal cannot be cast to 
> org.apache.hadoop.hive.common.type.HiveDecimal
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaHiveDecimalObjectInspector.getPrimitiveJavaObject(JavaHiveDecimalObjectInspector.java:51)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorUtils.getHiveDecimal(PrimitiveObjectInspectorUtils.java:1022)
>   at 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorConverter$HiveDecimalConverter.convert(PrimitiveObjectInspectorConverter.java:306)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ReturnObjectInspectorResolver.convertIfNecessary(GenericUDFUtils.java:179)
>   at 
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFIf.evaluate(GenericUDFIf.java:82)
>   at org.apache.spark.sql.hive.HiveGenericUdf.eval(hiveUdfs.scala:276)
>   at 
> org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:84)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:62)
>   at 
> org.apache.spark.sql.catalyst.expressions.MutableProjection.apply(Projection.scala:51)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at scala.collection.Iterator$class.foreach(Iterator.scala:727)
>   at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
>   at 
> org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:309)
>   at 
> org.apache.spark.sql.execution.BroadcastNestedLoopJoin$$anonfun$4.apply(joins.scala:303)
>   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
>   at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:571)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
>   at org.apache.spark.scheduler.Task.run(Task.scala:51)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>   at java.lang.Thread.run(Thread.java:662)
> {quote}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3034) [HIve] java.sql.Date cannot be cast to java.sql.Timestamp

2014-08-15 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098985#comment-14098985
 ] 

Michael Armbrust commented on SPARK-3034:
-

Can you provide the query?

> [HIve] java.sql.Date cannot be cast to java.sql.Timestamp
> -
>
> Key: SPARK-3034
> URL: https://issues.apache.org/jira/browse/SPARK-3034
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, SQL
>Affects Versions: 1.0.2
>Reporter: pengyanhong
>Priority: Blocker
>
> run a simple HiveQL via yarn-cluster, got error as below:
> {quote}
> Exception in thread "Thread-2" java.lang.reflect.InvocationTargetException
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>   at java.lang.reflect.Method.invoke(Method.java:597)
>   at 
> org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:199)
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 0.0:127 failed 3 times, most recent failure: Exception failure in TID 
> 141 on host A01-R06-I147-41.jd.local: java.lang.ClassCastException: 
> java.sql.Date cannot be cast to java.sql.Timestamp
> 
> org.apache.hadoop.hive.serde2.objectinspector.primitive.JavaTimestampObjectInspector.getPrimitiveWritableObject(JavaTimestampObjectInspector.java:33)
> 
> org.apache.hadoop.hive.serde2.lazy.LazyUtils.writePrimitiveUTF8(LazyUtils.java:251)
> 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:486)
> 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serializeField(LazySimpleSerDe.java:439)
> 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.serialize(LazySimpleSerDe.java:423)
> 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:200)
> 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$3$$anonfun$apply$1.apply(InsertIntoHiveTable.scala:192)
> scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
> 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable.org$apache$spark$sql$hive$execution$InsertIntoHiveTable$$writeToFile$1(InsertIntoHiveTable.scala:149)
> 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
> 
> org.apache.spark.sql.hive.execution.InsertIntoHiveTable$$anonfun$saveAsHiveFile$1.apply(InsertIntoHiveTable.scala:158)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:111)
> org.apache.spark.scheduler.Task.run(Task.scala:51)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
> java.lang.Thread.run(Thread.java:662)
> Driver stacktrace:
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1049)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1033)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1031)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>   at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1031)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:635)
>   at scala.Option.foreach(Option.scala:236)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:635)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1234)
>   at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>   at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>   at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>   at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>   at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>   at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>   at 
> scala.concurrent.forkjoin.ForkJoinPool

[jira] [Resolved] (SPARK-2717) BasicBlockFetchIterator#next should log when it gets stuck

2014-08-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2717.
---

Resolution: Won't Fix

This is subsumed by the patch that adds timeouts to BasicBlockFetchIterator.

> BasicBlockFetchIterator#next should log when it gets stuck
> --
>
> Key: SPARK-2717
> URL: https://issues.apache.org/jira/browse/SPARK-2717
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: Josh Rosen
>Priority: Critical
>
> If this is stuck for a long time waiting for blocks, we should log what nodes 
> it is waiting for to help debugging. One way to do this is to call take() 
> with a timeout (e.g. 60 seconds) and when the timeout expires log a message 
> for the blocks it is still waiting for. This could all happen in a loop so 
> that the wait just restarts after the message is logged.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2911) provide rdd.parent[T](j) to obtain jth parent of rdd

2014-08-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2911?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2911.
---

   Resolution: Fixed
Fix Version/s: 1.2.0
 Assignee: Erik Erlandson

Marking as 'fixed' in 1.2.0 since these pull requests were merged into master 
but not into branch-1.1.

> provide rdd.parent[T](j) to obtain jth parent of rdd
> 
>
> Key: SPARK-2911
> URL: https://issues.apache.org/jira/browse/SPARK-2911
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Erik Erlandson
>Assignee: Erik Erlandson
>Priority: Minor
>  Labels: easyfix, easytest
> Fix For: 1.2.0
>
>
> For writing RDD subclasses that involve more than a single parent dependency, 
> it would be convenient (and more readable) to say:
> rdd.parent[T](j)
> instead of:
> rdd.dependencies(j).rdd.asInstanceOf[RDD[T]]



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2110) Misleading help displayed for interactive mode pyspark --help

2014-08-15 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-2110.
---

   Resolution: Fixed
Fix Version/s: 1.1.0

I think this was fixed by SPARK-2678: these options now take effect when 
launching pyspark shells.

> Misleading help displayed for interactive mode pyspark --help
> -
>
> Key: SPARK-2110
> URL: https://issues.apache.org/jira/browse/SPARK-2110
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, PySpark
>Affects Versions: 1.1.0
>Reporter: Prashant Sharma
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.1.0
>
>
> The help displayed with command pyspark --help is not relevant for 
> interactive mode.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-3028.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1961
[https://github.com/apache/spark/pull/1961]

> sparkEventToJson should support SparkListenerExecutorMetricsUpdate
> --
>
> Key: SPARK-3028
> URL: https://issues.apache.org/jira/browse/SPARK-3028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Priority: Blocker
> Fix For: 1.1.0
>
>
> SparkListenerExecutorMetricsUpdate was added without updating 
> org.apache.spark.util.JsonProtocol.sparkEventToJson.
> This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3028) sparkEventToJson should support SparkListenerExecutorMetricsUpdate

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3028:
---

Assignee: Sandy Ryza

> sparkEventToJson should support SparkListenerExecutorMetricsUpdate
> --
>
> Key: SPARK-3028
> URL: https://issues.apache.org/jira/browse/SPARK-3028
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Reynold Xin
>Assignee: Sandy Ryza
>Priority: Blocker
> Fix For: 1.1.0
>
>
> SparkListenerExecutorMetricsUpdate was added without updating 
> org.apache.spark.util.JsonProtocol.sparkEventToJson.
> This can crash the listener.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs

2014-08-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3075:
-

Fix Version/s: 1.2.0

> Expose a way for users to parse event logs
> --
>
> Key: SPARK-3075
> URL: https://issues.apache.org/jira/browse/SPARK-3075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
> Fix For: 1.2.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3075) Expose a way for users to parse event logs

2014-08-15 Thread Andrew Or (JIRA)
Andrew Or created SPARK-3075:


 Summary: Expose a way for users to parse event logs
 Key: SPARK-3075
 URL: https://issues.apache.org/jira/browse/SPARK-3075
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andrew Or






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3075) Expose a way for users to parse event logs

2014-08-15 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-3075:
-

Description: Both ReplayListenerBus and util.JsonProtocol are 
private[spark], so the user wants to parse the event logs themselves for 
analytics they will have to write their own JSON deserializers (or do some 
crazy reflection to access these methods). We should expose an easy way for 
them to do this.

> Expose a way for users to parse event logs
> --
>
> Key: SPARK-3075
> URL: https://issues.apache.org/jira/browse/SPARK-3075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andrew Or
> Fix For: 1.2.0
>
>
> Both ReplayListenerBus and util.JsonProtocol are private[spark], so the user 
> wants to parse the event logs themselves for analytics they will have to 
> write their own JSON deserializers (or do some crazy reflection to access 
> these methods). We should expose an easy way for them to do this.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-08-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098824#comment-14098824
 ] 

Josh Rosen edited comment on SPARK-922 at 8/15/14 6:10 PM:
---

Updated script, which also updates numpy:

{code}
yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 -t0 install numpy
{code}

And to check that numpy is successfully installed:

{code}
pssh -h /root/spark-ec2/slaves --inline-stdout 'python2.7 -c "import numpy; 
print numpy"'
{code}


was (Author: joshrosen):
Updated script, which also updates numpy:

{code}
yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 -t0 install numpy
{code}

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-08-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098824#comment-14098824
 ] 

Josh Rosen edited comment on SPARK-922 at 8/15/14 6:05 PM:
---

Updated script, which also updates numpy:

{code}
yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 -t0 install numpy
{code}


was (Author: joshrosen):
Updated script, which also updates numpy:

{code}
yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 install numpy
{code}

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-922) Update Spark AMI to Python 2.7

2014-08-15 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098824#comment-14098824
 ] 

Josh Rosen commented on SPARK-922:
--

Updated script, which also updates numpy:

{code}
yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves "wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27"
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -h /root/spark-ec2/slaves pip2.7 install numpy
{code}

> Update Spark AMI to Python 2.7
> --
>
> Key: SPARK-922
> URL: https://issues.apache.org/jira/browse/SPARK-922
> Project: Spark
>  Issue Type: Task
>  Components: EC2, PySpark
>Affects Versions: 0.9.0, 0.9.1, 1.0.0
>Reporter: Josh Rosen
> Fix For: 1.1.0
>
>
> Many Python libraries only support Python 2.7+, so we should make Python 2.7 
> the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3046) Set executor's class loader as the default serializer class loader

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098823#comment-14098823
 ] 

Apache Spark commented on SPARK-3046:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1972

> Set executor's class loader as the default serializer class loader
> --
>
> Key: SPARK-3046
> URL: https://issues.apache.org/jira/browse/SPARK-3046
> Project: Spark
>  Issue Type: Bug
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> This is an attempt to fix the problem outlined in SPARK-2878.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2468) Netty-based block server / client module

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098805#comment-14098805
 ] 

Apache Spark commented on SPARK-2468:
-

User 'rxin' has created a pull request for this issue:
https://github.com/apache/spark/pull/1971

> Netty-based block server / client module
> 
>
> Key: SPARK-2468
> URL: https://issues.apache.org/jira/browse/SPARK-2468
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, Spark Core
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>Priority: Critical
>
> Right now shuffle send goes through the block manager. This is inefficient 
> because it requires loading a block from disk into a kernel buffer, then into 
> a user space buffer, and then back to a kernel send buffer before it reaches 
> the NIC. It does multiple copies of the data and context switching between 
> kernel/user. It also creates unnecessary buffer in the JVM that increases GC
> Instead, we should use FileChannel.transferTo, which handles this in the 
> kernel space with zero-copy. See 
> http://www.ibm.com/developerworks/library/j-zerocopy/
> One potential solution is to use Netty.  Spark already has a Netty based 
> network module implemented (org.apache.spark.network.netty). However, it 
> lacks some functionality and is turned off by default. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3074) support groupByKey() with hot keys in PySpark

2014-08-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3074:
--

Summary: support groupByKey() with hot keys in PySpark  (was: support 
groupByKey() with hot keys)

> support groupByKey() with hot keys in PySpark
> -
>
> Key: SPARK-3074
> URL: https://issues.apache.org/jira/browse/SPARK-3074
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-15 Thread Davies Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098785#comment-14098785
 ] 

Davies Liu commented on SPARK-3073:
---

This is for PySpark, currently we do not support large data sets in reduce 
stage during sortBy() or sortByKey().

This also will be useful for groupByKey() with hot keys. (the memory can not 
hold one hot key).

> improve large sort (external sort) for PySpark
> --
>
> Key: SPARK-3073
> URL: https://issues.apache.org/jira/browse/SPARK-3073
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3073) improve large sort (external sort) for PySpark

2014-08-15 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-3073:
--

Summary: improve large sort (external sort) for PySpark  (was: improve 
large sort (external sort))

> improve large sort (external sort) for PySpark
> --
>
> Key: SPARK-3073
> URL: https://issues.apache.org/jira/browse/SPARK-3073
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3062) ShutdownHookManager is only available in Hadoop 2.x

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098776#comment-14098776
 ] 

Apache Spark commented on SPARK-3062:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1970

> ShutdownHookManager is only available in Hadoop 2.x
> ---
>
> Key: SPARK-3062
> URL: https://issues.apache.org/jira/browse/SPARK-3062
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.2
>Reporter: Cheng Lian
>Priority: Blocker
>
> PR [#1891|https://github.com/apache/spark/pull/1891] leverages 
> {{ShutdownHookManager}} to avoid {{IOException}} when {{EventLogging}} is 
> enabled. But unfortunately {{ShutdownHookManager}} is only available in 
> Hadoop 2.x. Compilation fails when building Spark with Hadoop 1.
> {code}
> $ ./sbt/sbt -Phive-thriftserver
> ...
> [ERROR] 
> /home/spark/software/source/compile/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:30:
>  object ShutdownHookManager is not a member of package org.apache.hadoop.util
> [ERROR] import org.apache.hadoop.util.ShutdownHookManager
> [ERROR]^
> [ERROR] 
> /home/spark/software/source/compile/spark/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkSQLCLIDriver.scala:125:
>  not found: value ShutdownHookManager
> [ERROR] ShutdownHookManager.get.addShutdownHook(
> [ERROR] ^‍
> [WARNING] one warning found
> [ERROR] two errors found‍
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2970) spark-sql script ends with IOException when EventLogging is enabled

2014-08-15 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098777#comment-14098777
 ] 

Apache Spark commented on SPARK-2970:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/1970

> spark-sql script ends with IOException when EventLogging is enabled
> ---
>
> Key: SPARK-2970
> URL: https://issues.apache.org/jira/browse/SPARK-2970
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.1.0
> Environment: CDH5.1.0 (Hadoop 2.3.0)
>Reporter: Kousuke Saruta
> Fix For: 1.1.0
>
>
> When spark-sql script run with spark.eventLog.enabled set true, it ends with 
> IOException because FileLogger can not create APPLICATION_COMPLETE file in 
> HDFS.
> It's is because a shutdown hook of SparkSQLCLIDriver is executed after a 
> shutdown hook of org.apache.hadoop.fs.FileSystem is executed.
> When spark.eventLog.enabled is true, the hook of SparkSQLCLIDriver finally 
> try to create a file to mark the application finished but the hook of 
> FileSystem try to close FileSystem.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies

2014-08-15 Thread Maxim Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098774#comment-14098774
 ] 

Maxim Ivanov commented on SPARK-1828:
-

I don't have a pull request at hand if you are askin that ;) But IMHO proper 
solution is to tinker with maven shade plugin, to drop classes pulled by hive 
dependency in favor of those specified in Spark POM. 

If it is done that way, then it would be possible to specify hive version using 
"-D" param in the same way we can specify hadoop version and be sure (to some 
extent of course :) ) that if it builds,it works.

> Created forked version of hive-exec that doesn't bundle other dependencies
> --
>
> Key: SPARK-1828
> URL: https://issues.apache.org/jira/browse/SPARK-1828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The hive-exec jar includes a bunch of Hive's dependencies in addition to hive 
> itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in 
> Spark to manage those dependencies.
> The only solution to this problem is to publish our own version of hive-exec 
> 0.12.0 that behaves correctly. While we are doing this, we might as well 
> re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 
> that we already have for Akka.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-975) Spark Replay Debugger

2014-08-15 Thread Phuoc Do (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098758#comment-14098758
 ] 

Phuoc Do commented on SPARK-975:


Cheng Lian, I saw that latest UI displays stack trace for each stage. Is there 
a way to filter out function calls that we don't display in debugger. There 
seems to be a lot of native code calls in there. See stack below.

I did some work with d3 force layout. See here:

https://github.com/dnprock/spark-debugger

Stack:

org.apache.spark.rdd.RDD.count(RDD.scala:904)
$line9.$read$$iwC$$iwC$$iwC$$iwC.(:15)
$line9.$read$$iwC$$iwC$$iwC.(:20)
$line9.$read$$iwC$$iwC.(:22)
$line9.$read$$iwC.(:24)
$line9.$read.(:26)
$line9.$read$.(:30)
$line9.$read$.()
$line9.$eval$.(:7)
$line9.$eval$.()
$line9.$eval.$print()
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:483)
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:789)
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1062)
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:615)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:646)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:610)

> Spark Replay Debugger
> -
>
> Key: SPARK-975
> URL: https://issues.apache.org/jira/browse/SPARK-975
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Affects Versions: 0.9.0
>Reporter: Cheng Lian
>  Labels: arthur, debugger
> Attachments: IMG_20140722_184149.jpg, RDD DAG.png
>
>
> The Spark debugger was first mentioned as {{rddbg}} in the [RDD technical 
> report|http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf].
> [Arthur|https://github.com/mesos/spark/tree/arthur], authored by [Ankur 
> Dave|https://github.com/ankurdave], is an old implementation of the Spark 
> debugger, which demonstrated both the elegance and power behind the RDD 
> abstraction.  Unfortunately, the corresponding GitHub branch was not merged 
> into the master branch and had stopped 2 years ago.  For more information 
> about Arthur, please refer to [the Spark Debugger Wiki 
> page|https://github.com/mesos/spark/wiki/Spark-Debugger] in the old GitHub 
> repository.
> As a useful tool for Spark application debugging and analysis, it would be 
> nice to have a complete Spark debugger.  In 
> [PR-224|https://github.com/apache/incubator-spark/pull/224], I propose a new 
> implementation of the Spark debugger, the Spark Replay Debugger (SRD).
> [PR-224|https://github.com/apache/incubator-spark/pull/224] is only a preview 
> for discussion.  In the current version, I only implemented features that can 
> illustrate the basic mechanisms.  There are still features appeared in Arthur 
> but missing in SRD, such as checksum based nondeterminsm detection and single 
> task debugging with conventional debugger (like {{jdb}}).  However, these 
> features can be easily built upon current SRD framework.  To minimize code 
> review effort, I didn't include them into the current version intentionally.
> Attached is the visualization of the MLlib ALS application (with 1 iteration) 
> generated by SRD.  For more information, please refer to [the SRD overview 
> document|http://spark-replay-debugger-overview.readthedocs.org/en/latest/].



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies

2014-08-15 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098752#comment-14098752
 ] 

Patrick Wendell commented on SPARK-1828:


Maxim - I think what you are pointing out is unrated to this exact issue. Spark 
hard-codes a specific version of Hive in our build. This is true whether or not 
we are pointing to a slightly modified version of Hive 0.12 or the actual Hive 
0.12.

The issue is that Hive does not have stable API's so we can't provide a version 
of Spark that is cross-compatible with different versions of Hive. We are 
trying to simplify our dependency on Hive to fix this.

Are you proposing a specific change here?

> Created forked version of hive-exec that doesn't bundle other dependencies
> --
>
> Key: SPARK-1828
> URL: https://issues.apache.org/jira/browse/SPARK-1828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The hive-exec jar includes a bunch of Hive's dependencies in addition to hive 
> itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in 
> Spark to manage those dependencies.
> The only solution to this problem is to publish our own version of hive-exec 
> 0.12.0 that behaves correctly. While we are doing this, we might as well 
> re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 
> that we already have for Akka.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3073) improve large sort (external sort)

2014-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3073?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098727#comment-14098727
 ] 

Sean Owen commented on SPARK-3073:
--

What does this refer to, and is it not the same as 
https://issues.apache.org/jira/browse/SPARK-2926 ?

> improve large sort (external sort)
> --
>
> Key: SPARK-3073
> URL: https://issues.apache.org/jira/browse/SPARK-3073
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>Assignee: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3074) support groupByKey() with hot keys

2014-08-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3074:
-

 Summary: support groupByKey() with hot keys
 Key: SPARK-3074
 URL: https://issues.apache.org/jira/browse/SPARK-3074
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3073) improve large sort (external sort)

2014-08-15 Thread Davies Liu (JIRA)
Davies Liu created SPARK-3073:
-

 Summary: improve large sort (external sort)
 Key: SPARK-3073
 URL: https://issues.apache.org/jira/browse/SPARK-3073
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3072) Yarn AM not always properly exiting after unregistering from RM

2014-08-15 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098705#comment-14098705
 ] 

Thomas Graves commented on SPARK-3072:
--

Note that in yarn-cluster mode the client side does quit, but the application 
master is still running on the cluster.

> Yarn AM not always properly exiting after unregistering from RM
> ---
>
> Key: SPARK-3072
> URL: https://issues.apache.org/jira/browse/SPARK-3072
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.0.2
>Reporter: Thomas Graves
>Assignee: Thomas Graves
>Priority: Critical
>
> The yarn application master doesn't always exit properly after unregistering 
> from the RM.  
> One way to reproduce is to ask for large containers (> 4g) but use jdk32 so 
> that all of them fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2924) Remove use of default arguments where disallowed by 2.11

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2924.


   Resolution: Fixed
Fix Version/s: 1.1.0

Issue resolved by pull request 1704
[https://github.com/apache/spark/pull/1704]

> Remove use of default arguments where disallowed by 2.11
> 
>
> Key: SPARK-2924
> URL: https://issues.apache.org/jira/browse/SPARK-2924
> Project: Spark
>  Issue Type: Sub-task
>  Components: Streaming
>Reporter: Patrick Wendell
>Assignee: Anand Avati
>Priority: Blocker
> Fix For: 1.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-2865) Potential deadlock: tasks could hang forever waiting to fetch a remote block even though most tasks finish

2014-08-15 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2865.


Resolution: Fixed

I believe this has been resolved by virtue of other patches to the connection 
manager and other components.

> Potential deadlock: tasks could hang forever waiting to fetch a remote block 
> even though most tasks finish
> --
>
> Key: SPARK-2865
> URL: https://issues.apache.org/jira/browse/SPARK-2865
> Project: Spark
>  Issue Type: Bug
>  Components: Shuffle, Spark Core
>Affects Versions: 1.0.1, 1.1.0
> Environment: 16-node EC2 r3.2xlarge cluster
>Reporter: Zongheng Yang
>Priority: Blocker
>
> In the application I tested, most of the tasks out of 128 tasks could finish, 
> but sometimes (pretty deterministically) either 1 or 3 tasks would just hang 
> forever (> 5 hrs with no progress at all) with the following stack trace. 
> There were no apparent failures from the UI, also the nodes where the stuck 
> tasks were running had no apparent memory/CPU/disk pressures.
> {noformat}
> "Executor task launch worker-0" daemon prio=10 tid=0x7f32ec003800 
> nid=0xaac waiting on condition [0x7f33f4428000]
>java.lang.Thread.State: WAITING (parking)
> at sun.misc.Unsafe.park(Native Method)
> - parking to wait for  <0x7f3e0d7198e8> (a 
> scala.concurrent.impl.Promise$CompletionLatch)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
> at 
> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
> at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
> at 
> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
> at scala.concurrent.Await$.result(package.scala:107)
> at 
> org.apache.spark.network.ConnectionManager.sendMessageReliablySync(ConnectionManager.scala:832)
> at 
> org.apache.spark.storage.BlockManagerWorker$.syncGetBlock(BlockManagerWorker.scala:122)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:497)
> at 
> org.apache.spark.storage.BlockManager$$anonfun$doGetRemote$2.apply(BlockManager.scala:495)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> at 
> org.apache.spark.storage.BlockManager.doGetRemote(BlockManager.scala:495)
> at 
> org.apache.spark.storage.BlockManager.getRemote(BlockManager.scala:481)
> at org.apache.spark.storage.BlockManager.get(BlockManager.scala:524)
> at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:44)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:227)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> This behavior does *not* appear on 1.0 (reusing the same cluster), but 
> appears on the master branch as of Aug 4, 2014 *and* 

[jira] [Created] (SPARK-3072) Yarn AM not always properly exiting after unregistering from RM

2014-08-15 Thread Thomas Graves (JIRA)
Thomas Graves created SPARK-3072:


 Summary: Yarn AM not always properly exiting after unregistering 
from RM
 Key: SPARK-3072
 URL: https://issues.apache.org/jira/browse/SPARK-3072
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.0.2
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical


The yarn application master doesn't always exit properly after unregistering 
from the RM.  

One way to reproduce is to ask for large containers (> 4g) but use jdk32 so 
that all of them fail.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-3071) Increase default driver memory

2014-08-15 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-3071:


 Summary: Increase default driver memory
 Key: SPARK-3071
 URL: https://issues.apache.org/jira/browse/SPARK-3071
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Xiangrui Meng


The current default is 512M, which is usually too small because user also uses 
driver to do some computation. In local mode, executor memory setting is 
ignored while only driver memory is used, which provides more incentive to 
increase the default driver memory.

I suggest

1. 2GB in local mode and warn users if executor memory is set a bigger value
2. same as worker memory on an EC2 standalone server



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3070) Kryo deserialization without using the custom registrator

2014-08-15 Thread Daniel Darabos (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Darabos updated SPARK-3070:
--

Summary: Kryo deserialization without using the custom registrator  (was: 
Kry deserialization without using the custom registrator)

> Kryo deserialization without using the custom registrator
> -
>
> Key: SPARK-3070
> URL: https://issues.apache.org/jira/browse/SPARK-3070
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.0.2
>Reporter: Andras Nemeth
>
> If an RDD partition is cached on executor1 and used by a task on executor2 
> then the partition needs to be serialized and sent over. For this particular 
> serialization/deserialization usecase, when using kry, it appears that the 
> custom registrator will not be used on the deserialization side. This of 
> course results in some totally misleading kry deserialization errors.
> The cause for this behavior seems to be that the thread running this 
> deserialization has a classloader which does not have the jars specified in 
> the SparkConf on its classpath. So it fails to load the Registrator with a 
> ClassNotFoundException, but it catches the exception and happily continues 
> without a registrator. (A bug on its own right in my opinion.)
> To reproduce, have two rdds partitioned the same way (as in with the same 
> partitioner) but corresponding partitions cached on different machines, then 
> join them. See below a somewhat convoluted way to achieve this. If you run 
> the below program on a spark cluster with two workers, each with one core, 
> you will be able to trigger the bug. Basically it runs two counts in 
> parallel, which ensures that the two RDDs will be computed in parallel, and 
> as a consequence on different executors.
> {code:java}
> import com.esotericsoftware.kryo.Kryo
> import org.apache.spark.HashPartitioner
> import org.apache.spark.SparkConf
> import org.apache.spark.SparkContext
> import org.apache.spark.SparkContext._
> import org.apache.spark.rdd.RDD
> import org.apache.spark.serializer.KryoRegistrator
> import scala.actors.Actor
> case class MyClass(a: Int)
> class MyKryoRegistrator extends KryoRegistrator {
>   override def registerClasses(kryo: Kryo) {
> kryo.register(classOf[MyClass])
>   }
> }
> class CountActor(rdd: RDD[_]) extends Actor {
>   def act() {
> println("Start count")
> println(rdd.count)
> println("Stop count")
>   }
> }
> object KryBugExample  {
>   def main(args: Array[String]) {
> val sparkConf = new SparkConf()
>   .setMaster(args(0))
>   .setAppName("KryBugExample")
>   .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
>   .set("spark.kryo.registrator", "MyKryoRegistrator")
>   .setJars(Seq("target/scala-2.10/krybugexample_2.10-0.1-SNAPSHOT.jar"))
> val sc = new SparkContext(sparkConf)
> val partitioner = new HashPartitioner(1)
> val rdd1 = sc
>   .parallelize((0 until 10).map(i => (i, MyClass(i))), 1)
>   .partitionBy(partitioner).cache
> val rdd2 = sc
>   .parallelize((0 until 10).map(i => (i, MyClass(i * 2))), 1)
>   .partitionBy(partitioner).cache
> new CountActor(rdd1).start
> new CountActor(rdd2).start
> println(rdd1.join(rdd2).count)
> while (true) {}
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3070) Kry deserialization without using the custom registrator

2014-08-15 Thread Andras Nemeth (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andras Nemeth updated SPARK-3070:
-

Description: 
If an RDD partition is cached on executor1 and used by a task on executor2 then 
the partition needs to be serialized and sent over. For this particular 
serialization/deserialization usecase, when using kry, it appears that the 
custom registrator will not be used on the deserialization side. This of course 
results in some totally misleading kry deserialization errors.

The cause for this behavior seems to be that the thread running this 
deserialization has a classloader which does not have the jars specified in the 
SparkConf on its classpath. So it fails to load the Registrator with a 
ClassNotFoundException, but it catches the exception and happily continues 
without a registrator. (A bug on its own right in my opinion.)

To reproduce, have two rdds partitioned the same way (as in with the same 
partitioner) but corresponding partitions cached on different machines, then 
join them. See below a somewhat convoluted way to achieve this. If you run the 
below program on a spark cluster with two workers, each with one core, you will 
be able to trigger the bug. Basically it runs two counts in parallel, which 
ensures that the two RDDs will be computed in parallel, and as a consequence on 
different executors.

{code:java}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.HashPartitioner
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.serializer.KryoRegistrator
import scala.actors.Actor

case class MyClass(a: Int)

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass])
  }
}

class CountActor(rdd: RDD[_]) extends Actor {
  def act() {
println("Start count")
println(rdd.count)
println("Stop count")
  }
}

object KryBugExample  {
  def main(args: Array[String]) {
val sparkConf = new SparkConf()
  .setMaster(args(0))
  .setAppName("KryBugExample")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", "MyKryoRegistrator")
  .setJars(Seq("target/scala-2.10/krybugexample_2.10-0.1-SNAPSHOT.jar"))
val sc = new SparkContext(sparkConf)

val partitioner = new HashPartitioner(1)
val rdd1 = sc
  .parallelize((0 until 10).map(i => (i, MyClass(i))), 1)
  .partitionBy(partitioner).cache
val rdd2 = sc
  .parallelize((0 until 10).map(i => (i, MyClass(i * 2))), 1)
  .partitionBy(partitioner).cache
new CountActor(rdd1).start
new CountActor(rdd2).start
println(rdd1.join(rdd2).count)
while (true) {}
  }
}
{code}

  was:
If an RDD partition is cached on executor1 and used by a task on executor2 then 
the partition needs to be serialized and sent over. For this particular 
serialization/deserialization usecase, when using kry, it appears that the 
custom registrator will not be used on the deserialization side. This of course 
results in some totally misleading kry deserialization errors.

The cause for this behavior seems to be that the thread running this 
deserialization has a classloader which does not have the jars specified in the 
SparkConf on its classpath. So it fails to load the Registrator with a 
ClassNotFoundException, but it catches the exception and happily continues 
without a registrator. (A bug on its own right in my opinion.)

To reproduce, have two rdds partitioned the same way (as in with the same 
partitioner) but corresponding partitions cached on different machines, then 
join them. See below a somewhat convoluted way to achieve this. If you run the 
below program on a spark cluster with two workers, each with one core, you will 
be able to trigger the bug. Basically it runs two counts in parallel, which 
ensures that the two RDDs will be computed in parallel, and as a consequence on 
different executors.

{code:scala}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.HashPartitioner
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.serializer.KryoRegistrator
import scala.actors.Actor

case class MyClass(a: Int)

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass])
  }
}

class CountActor(rdd: RDD[_]) extends Actor {
  def act() {
println("Start count")
println(rdd.count)
println("Stop count")
  }
}

object KryBugExample  {
  def main(args: Array[String]) {
val sparkConf = new SparkConf()
  .setMaster(args(0))
  .setAppName("KryBugExample")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .se

[jira] [Created] (SPARK-3070) Kry deserialization without using the custom registrator

2014-08-15 Thread Andras Nemeth (JIRA)
Andras Nemeth created SPARK-3070:


 Summary: Kry deserialization without using the custom registrator
 Key: SPARK-3070
 URL: https://issues.apache.org/jira/browse/SPARK-3070
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
Reporter: Andras Nemeth


If an RDD partition is cached on executor1 and used by a task on executor2 then 
the partition needs to be serialized and sent over. For this particular 
serialization/deserialization usecase, when using kry, it appears that the 
custom registrator will not be used on the deserialization side. This of course 
results in some totally misleading kry deserialization errors.

The cause for this behavior seems to be that the thread running this 
deserialization has a classloader which does not have the jars specified in the 
SparkConf on its classpath. So it fails to load the Registrator with a 
ClassNotFoundException, but it catches the exception and happily continues 
without a registrator. (A bug on its own right in my opinion.)

To reproduce, have two rdds partitioned the same way (as in with the same 
partitioner) but corresponding partitions cached on different machines, then 
join them. See below a somewhat convoluted way to achieve this. If you run the 
below program on a spark cluster with two workers, each with one core, you will 
be able to trigger the bug. Basically it runs two counts in parallel, which 
ensures that the two RDDs will be computed in parallel, and as a consequence on 
different executors.

{code:scala}
import com.esotericsoftware.kryo.Kryo
import org.apache.spark.HashPartitioner
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd.RDD
import org.apache.spark.serializer.KryoRegistrator
import scala.actors.Actor

case class MyClass(a: Int)

class MyKryoRegistrator extends KryoRegistrator {
  override def registerClasses(kryo: Kryo) {
kryo.register(classOf[MyClass])
  }
}

class CountActor(rdd: RDD[_]) extends Actor {
  def act() {
println("Start count")
println(rdd.count)
println("Stop count")
  }
}

object KryBugExample  {
  def main(args: Array[String]) {
val sparkConf = new SparkConf()
  .setMaster(args(0))
  .setAppName("KryBugExample")
  .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .set("spark.kryo.registrator", "MyKryoRegistrator")
  .setJars(Seq("target/scala-2.10/krybugexample_2.10-0.1-SNAPSHOT.jar"))
val sc = new SparkContext(sparkConf)

val partitioner = new HashPartitioner(1)
val rdd1 = sc
  .parallelize((0 until 10).map(i => (i, MyClass(i))), 1)
  .partitionBy(partitioner).cache
val rdd2 = sc
  .parallelize((0 until 10).map(i => (i, MyClass(i * 2))), 1)
  .partitionBy(partitioner).cache
new CountActor(rdd1).start
new CountActor(rdd2).start
println(rdd1.join(rdd2).count)
while (true) {}
  }
}
{code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1828) Created forked version of hive-exec that doesn't bundle other dependencies

2014-08-15 Thread Maxim Ivanov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098656#comment-14098656
 ] 

Maxim Ivanov commented on SPARK-1828:
-

Because of this change any incompatibilities with Hive in Hadoop distros are 
hidden untile you run job on an actual cluster. Unless you are willing to keep 
your fork up to date with every major Hadoop distro of course. 

Right now we see incompatibility with CDH5.0.2 Hive, but I'd rather have it 
failing to compile rather that seeing problems at runtime

> Created forked version of hive-exec that doesn't bundle other dependencies
> --
>
> Key: SPARK-1828
> URL: https://issues.apache.org/jira/browse/SPARK-1828
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.0.0
>Reporter: Patrick Wendell
>Assignee: Patrick Wendell
>Priority: Blocker
> Fix For: 1.0.0
>
>
> The hive-exec jar includes a bunch of Hive's dependencies in addition to hive 
> itself (protobuf, guava, etc). See HIVE-5733. This breaks any attempt in 
> Spark to manage those dependencies.
> The only solution to this problem is to publish our own version of hive-exec 
> 0.12.0 that behaves correctly. While we are doing this, we might as well 
> re-write the protobuf dependency to use the shaded version of protobuf 2.4.1 
> that we already have for Akka.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2927) Add a conf to configure if we always read Binary columns stored in Parquet as String columns

2014-08-15 Thread Teng Qiu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098573#comment-14098573
 ] 

Teng Qiu commented on SPARK-2927:
-

SPARK-2699 could also be closed, these two ticket are almost the same. only one 
comment about auto-detection: 
https://github.com/apache/spark/pull/1855#discussion-diff-16294353

> Add a conf to configure if we always read Binary columns stored in Parquet as 
> String columns
> 
>
> Key: SPARK-2927
> URL: https://issues.apache.org/jira/browse/SPARK-2927
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Yin Huai
> Fix For: 1.1.0
>
>
> Based on Parquet spec (https://github.com/Parquet/parquet-format), "strings 
> are stored as byte arrays (binary) with a UTF8 annotation". However, if the 
> data generator does not follow it, we will only read binary values back 
> instead of string values.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1861) ArrayIndexOutOfBoundsException when reading bzip2 files

2014-08-15 Thread sam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14098477#comment-14098477
 ] 

sam commented on SPARK-1861:


OK, so what I need to do is ask my DevOps to upgrade our cluster to 5.1.0.

> ArrayIndexOutOfBoundsException when reading bzip2 files
> ---
>
> Key: SPARK-1861
> URL: https://issues.apache.org/jira/browse/SPARK-1861
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 0.9.0, 1.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Hadoop uses CBZip2InputStream to decode bzip2 files. However, the 
> implementation is not threadsafe and Spark may run multiple tasks in the same 
> JVM, which leads to this error. This is not a problem for Hadoop MapReduce 
> because Hadoop runs each task in a separate JVM.
> A workaround is to set `SPARK_WORKER_CORES=1` in spark-env.sh for a 
> standalone cluster.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >