date:20140929

Joseph K. Bradley created SPARK-3717:


 Summary: DecisionTree, RandomForest: Partition by feature
 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning instances cost when:
* D * (M * 8 + N)  2^D * M * B * C * 8
* D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the RHS)
* N  [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature


 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Description: 
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning instances cost when:
* D * (M * 8 + N)  2^D * M * B * C * 8
* D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
right hand side)
* N  [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8


  was:
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning

[jira] [Created] (SPARK-3718) FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory

2014-09-29 Thread Kousuke Saruta (JIRA)

Kousuke Saruta created SPARK-3718:
-

 Summary: FsHistoryProvider should consider spark.eventLog.dir not 
only spark.history.fs.logDirectory
 Key: SPARK-3718
 URL: https://issues.apache.org/jira/browse/SPARK-3718
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor


It's a minor improvement.

FsHistoryProvider reads event logs from the directory represented as 
spark.history.fs.logDirectory, but I think the directory is nearly equal the 
directory represented as spark.eventLog.dir so we should consider 
spark.eventLog.dir too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Closed] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile

2014-09-29 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta closed SPARK-3567.
-
Resolution: Fixed

 appId field in SparkDeploySchedulerBackend should be volatile
 -

 Key: SPARK-3567
 URL: https://issues.apache.org/jira/browse/SPARK-3567
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Assignee: Kousuke Saruta
 Fix For: 1.2.0


 appId field in SparkDeploySchedulerBackend is set by 
 AppClient.ClientActor#receiveWithLogging and appId is referred via 
 SparkDeploySchedulerBackend#applicationId.
 A thread which runs AppClient.ClientActor and a thread invoking 
 SparkDeploySchedulerBackend#applicationId can be another threads so appId 
 should be volatile.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2516) Bootstrapping

2014-09-29 Thread Yu Ishikawa (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151416#comment-14151416
 ] 

Yu Ishikawa commented on SPARK-2516:


Thank you for assigning me with this issue and sharing the paper with me.
I understand that we need to abstract the record type and the estimator 
interface.

 Bootstrapping
 -

 Key: SPARK-2516
 URL: https://issues.apache.org/jira/browse/SPARK-2516
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Yu Ishikawa

 Support re-sampling and bootstrap estimators in MLlib.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3718) FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory


[ 
https://issues.apache.org/jira/browse/SPARK-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151418#comment-14151418
 ] 

Apache Spark commented on SPARK-3718:
-

User 'sarutak' has created a pull request for this issue:
https://github.com/apache/spark/pull/2573

 FsHistoryProvider should consider spark.eventLog.dir not only 
 spark.history.fs.logDirectory
 ---

 Key: SPARK-3718
 URL: https://issues.apache.org/jira/browse/SPARK-3718
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 1.2.0
Reporter: Kousuke Saruta
Priority: Minor

 It's a minor improvement.
 FsHistoryProvider reads event logs from the directory represented as 
 spark.history.fs.logDirectory, but I think the directory is nearly equal the 
 directory represented as spark.eventLog.dir so we should consider 
 spark.eventLog.dir too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3714) Spark workflow scheduler


[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422
 ] 

Sean Owen commented on SPARK-3714:
--

Another meta-question for everyone: at what point should a project like this 
simply be a separate add-on project? For example, Oozie is a stand-alone 
project. Not everything needs to happen directly under the Spark umbrella, 
which is already broad. One upside to including it is that Spark is it perhaps 
gets more attention. Spark is forced to maintain and keep it compatible, which 
is also a downside I suppose. There is also the effect that you create an 
official workflow engine and discourage others.

I am more asking the question than suggesting an answer, but, my reaction was 
that this could live outside Spark just fine.

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream


[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151432#comment-14151432
 ] 

Sean Owen commented on SPARK-3274:
--

I don't think that's the same thing. It is just saying you are reading a 
{{SequenceFile}} of {{Text}} and then pretending they are Strings. Are you sure 
the first {{return}} statement works? 

They will both work as expected if you just call {{.toString()}} on the 
{{Text}} objects you are actually operating on.

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream

2014-09-29 Thread Pulkit Bhuwalka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151447#comment-14151447
 ] 

Pulkit Bhuwalka commented on SPARK-3274:


[~sowen] - you are right. I was making the mistake of reading the sequence file 
as String instead of text. Addind toString fixed the problem. Thanks a lot for 
your help.

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream

2014-09-29 Thread Pulkit Bhuwalka (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151447#comment-14151447
 ] 

Pulkit Bhuwalka edited comment on SPARK-3274 at 9/29/14 7:27 AM:
-

[~sowen] - you are right. I was making the mistake of reading the sequence file 
as String instead of Text. Adding toString fixed the problem. Thanks a lot for 
your help.


was (Author: pulkit.bosc...@gmail.com):
[~sowen] - you are right. I was making the mistake of reading the sequence file 
as String instead of text. Addind toString fixed the problem. Thanks a lot for 
your help.

 Spark Streaming Java API reports java.lang.ClassCastException when calling 
 collectAsMap on JavaPairDStream
 --

 Key: SPARK-3274
 URL: https://issues.apache.org/jira/browse/SPARK-3274
 Project: Spark
  Issue Type: Bug
  Components: Java API
Affects Versions: 1.0.2
Reporter: Jack Hu

 Reproduce code:
 scontext
   .socketTextStream(localhost, 1)
   .mapToPair(new PairFunctionString, String, String(){
   public Tuple2String, String call(String arg0)
   throws Exception {
   return new Tuple2String, String(1, arg0);
   }
   })
   .foreachRDD(new Function2JavaPairRDDString, String, Time, 
 Void() {
   public Void call(JavaPairRDDString, String v1, Time 
 v2) throws Exception {
   System.out.println(v2.toString() + :  + 
 v1.collectAsMap().toString());
   return null;
   }
   });
 Exception:
 java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to 
 [Lscala.Tupl
 e2;
 at 
 org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s
 cala:447)
 at 
 org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala:
 464)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90)
 at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR
 DD$2.apply(JavaDStreamLike.scala:282)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc
 V$sp(ForEachDStream.scala:41)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at 
 org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo
 rEachDStream.scala:40)
 at scala.util.Try$.apply(Try.scala:161)
 at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
 at 
 org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2159) Spark shell exit() does not stop SparkContext


 [ 
https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2159.
--
   Resolution: Won't Fix
Fix Version/s: (was: 1.2.0)

The discussion in the PR suggests this is WontFix. 
https://github.com/apache/spark/pull/1230#issuecomment-54045637

 Spark shell exit() does not stop SparkContext
 -

 Key: SPARK-2159
 URL: https://issues.apache.org/jira/browse/SPARK-2159
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Priority: Minor

 If you type exit() in spark shell, it is equivalent to a Ctrl+C and does 
 not stop the SparkContext. This is used very commonly to exit a shell, and it 
 would be good if it is equivalent to Ctrl+D instead, which does stop the 
 SparkContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Egor Pakhomov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151458#comment-14151458
 ] 

Egor Pakhomov commented on SPARK-3714:
--

I agree with your concerns - it should be separate from Spark codebase project. 
But it can't be workflow engine for common purposes like oozie. There are too 
many essential spark specific things in new engine, like supporting spark 
context, providing capability to write simple jobs in scala from HUE.

I think for such engine best approach is like ooyala job server - separate 
project, but spark oriented.  

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2643) Stages web ui has ERROR when pool name is None


 [ 
https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-2643.
--
Resolution: Fixed

Discussion suggests this was fixed by a related change: 
https://github.com/apache/spark/pull/1854#issuecomment-55061571

 Stages web ui has ERROR when pool name is None
 --

 Key: SPARK-2643
 URL: https://issues.apache.org/jira/browse/SPARK-2643
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Reporter: YanTang Zhai
Priority: Minor

 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/
 java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:313)
 at scala.None$.get(Option.scala:311)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132)
 at 
 org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at 
 scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
 at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969)
 at 
 scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969)
 at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38)
 at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40)
 at 
 org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60)
 at 
 org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52)
 at 
 org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at 
 org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65)
 at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
 at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
 at 
 org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086)
 at 
 org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:370)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at

[jira] [Resolved] (SPARK-1208) after some hours of working the :4040 monitoring UI stops working.


 [ 
https://issues.apache.org/jira/browse/SPARK-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1208.
--
Resolution: Fixed

This appears to be a similar, if not the same issue, as in SPARK-2643. The 
discussion in the PR indicates this was resolved by a subsequent change: 
https://github.com/apache/spark/pull/1854#issuecomment-55061571

 after some hours of working the :4040 monitoring UI stops working.
 --

 Key: SPARK-1208
 URL: https://issues.apache.org/jira/browse/SPARK-1208
 Project: Spark
  Issue Type: Bug
  Components: Web UI
Affects Versions: 0.9.0
Reporter: Tal Sliwowicz

 This issue is inconsistent, but it did not exist in prior versions.
 The Driver app otherwise works normally.
 The log file below is from the driver.
 2014-03-09 07:24:55,837 WARN  [qtp1187052686-17453] AbstractHttpConnection  - 
 /stages/
 java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:313)
 at scala.None$.get(Option.scala:311)
 at 
 org.apache.spark.ui.jobs.StageTable.org$apache$spark$ui$jobs$StageTable$$stageRow(StageTable.scala:114)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57)
 at 
 org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at 
 scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at org.apache.spark.ui.jobs.StageTable.stageTable(StageTable.scala:57)
 at org.apache.spark.ui.jobs.StageTable.toNodeSeq(StageTable.scala:39)
 at org.apache.spark.ui.jobs.IndexPage.render(IndexPage.scala:81)
 at 
 org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59)
 at 
 org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59)
 at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040)
 at 
 org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976)
 at 
 org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
 at 
 org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
 at 
 org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
 at org.eclipse.jetty.server.Server.handle(Server.java:363)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920)
 at 
 org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982)
 at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635)
 at 
 org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
 at 
 org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628)
 at 
 org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
 at 
 org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
 at java.lang.Thread.run(Thread.java:662)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3203) ClassNotFoundException in spark-shell with Cassandra


 [ 
https://issues.apache.org/jira/browse/SPARK-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-3203:
-
Summary: ClassNotFoundException in spark-shell with Cassandra  (was: 
ClassNotFound Exception)

 ClassNotFoundException in spark-shell with Cassandra
 

 Key: SPARK-3203
 URL: https://issues.apache.org/jira/browse/SPARK-3203
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.2
 Environment: Ubuntu 12.04, openjdk 64 bit 7u65
Reporter: Rohit Kumar

 I am using Spark with as processing engine over cassandra. I have only one 
 master and a worker node. 
 I am executing   following code in spark-shell :
 sc.stop
  import org.apache.spark.SparkContext
  import org.apache.spark.SparkConf
 import com.datastax.spark.connector._
 val conf = new SparkConf(true).set(spark.cassandra.connection.host, 
 127.0.0.1)
 val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector 
 Test, conf)
  val rdd = sc.cassandraTable(test, kv)
 println(rdd.map(_.getInt(value)).sum) 
 I am getting following error:
 14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to 
 another address
 14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0
 14/08/25 18:49:39 INFO Executor: Running task ID 0
 14/08/25 18:49:39 ERROR Executor: Exception in task ID 0
 java.lang.ClassNotFoundException: 
 $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
   at java.security.AccessController.doPrivileged(Native Method)
   at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
   at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
   at java.lang.Class.forName0(Native Method)
   at java.lang.Class.forName(Class.java:270)
   at 
 org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60)
   at 
 java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
   at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   at java.lang.reflect.Method.invoke(Method.java:606)
   at 
 java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at 
 java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
   at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
   at 
 org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
   at 
 org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
   at 
 org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
   at 
 java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837)
   at 
 java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796)
   at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
   at

[jira] [Updated] (SPARK-1381) Spark to Shark direct streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1381:
-
Priority: Major  (was: Blocker)

It sounds like this is WontFix at this point, if there was a problem to begin 
with, as Shark is deprecated.

 Spark to Shark direct streaming
 ---

 Key: SPARK-1381
 URL: https://issues.apache.org/jira/browse/SPARK-1381
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Input/Output, Java API, Spark 
 Core
Affects Versions: 0.8.1
Reporter: Abhishek Tripathi
  Labels: performance

 Hi,
 I'm trying to push data coming from Spark streaming to Shark cache table. 
 I thought of using JDBC API but Shark(0.81) does not support direct insert 
 statement  i.e insert into emp values(2, Apia)  .
 I don't want to store Spark streaming into HDFS  and then copy that data to 
 Shark table.
 Can somebody plz help
 1.  how can I directly point Spark streaming data to Shark table/cachedTable 
 ? otherway how can Shark pickup data directly from Spark streaming? 
 2. Does Shark0.81 has direct insert statement without referring to other 
 table?
 It is really stopping us to use Spark further more. need your assistant 
 urgently.
 Thanks in advance.
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1381) Spark to Shark direct streaming


 [ 
https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1381.
--
Resolution: Won't Fix

 Spark to Shark direct streaming
 ---

 Key: SPARK-1381
 URL: https://issues.apache.org/jira/browse/SPARK-1381
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Input/Output, Java API, Spark 
 Core
Affects Versions: 0.8.1
Reporter: Abhishek Tripathi
  Labels: performance

 Hi,
 I'm trying to push data coming from Spark streaming to Shark cache table. 
 I thought of using JDBC API but Shark(0.81) does not support direct insert 
 statement  i.e insert into emp values(2, Apia)  .
 I don't want to store Spark streaming into HDFS  and then copy that data to 
 Shark table.
 Can somebody plz help
 1.  how can I directly point Spark streaming data to Shark table/cachedTable 
 ? otherway how can Shark pickup data directly from Spark streaming? 
 2. Does Shark0.81 has direct insert statement without referring to other 
 table?
 It is really stopping us to use Spark further more. need your assistant 
 urgently.
 Thanks in advance.
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3714) Spark workflow scheduler

2014-09-29 Thread Mridul Muralidharan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151486#comment-14151486
 ] 

Mridul Muralidharan commented on SPARK-3714:



Most of the drawbacks mentioned are not severe imo - at best, they are 
unfamiliarity with oozie platform (points 2, 3, 4, 5).
Point 1 is interesting (sharing spark context) - though from a fault tolerance 
point of view, it makes supporting it challenging; ofcourse oozie was not, 
probably, designed with something like spark in mind - so there might be 
changes to oozie which might benefit spark; we could engage with oozie dev for 
that.

But discarding it to reinvent something when oozie already does everything 
mentioned in requirements section seems counterintutive.


I have seen multiple attempts to 'simplify' workflow management, and at 
production scale almost everything ends up being similar ...
Note that most production jobs have to depend on a variety of jobs - not just 
spark or MR - so you will end up converigng on a variant of oozie anyway :-)

Having said that, if you want to take a crack at solving this with spark 
specific idioms in mind, it would be interesting to see the result - I dont 
want to dissuade from doing so !
We might end up with something quite interesting.

 Spark workflow scheduler
 

 Key: SPARK-3714
 URL: https://issues.apache.org/jira/browse/SPARK-3714
 Project: Spark
  Issue Type: New Feature
  Components: Project Infra
Reporter: Egor Pakhomov
Priority: Minor

 [Design doc | 
 https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing]
 Spark stack currently hard to use in the production processes due to the lack 
 of next features:
 * Scheduling spark jobs
 * Retrying failed spark job in big pipeline
 * Share context among jobs in pipeline
 * Queue jobs
 Typical usecase for such platform would be - wait for new data, process new 
 data, learn ML models on new data, compare model with previous one, in case 
 of success - rewrite model in HDFS directory for current production model 
 with new one.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages

2014-09-29 Thread uncleGen (JIRA)

uncleGen created SPARK-3719:
---

 Summary: Spark UI: complete/failed stages is better to show the 
total number of stages 
 Key: SPARK-3719
 URL: https://issues.apache.org/jira/browse/SPARK-3719
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages


[ 
https://issues.apache.org/jira/browse/SPARK-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151490#comment-14151490
 ] 

Apache Spark commented on SPARK-3719:
-

User 'uncleGen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2574

 Spark UI: complete/failed stages is better to show the total number of 
 stages 
 

 Key: SPARK-3719
 URL: https://issues.apache.org/jira/browse/SPARK-3719
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: uncleGen
Priority: Minor





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3582) Spark SQL having issue with existing Hive UDFs which take Map as a parameter

2014-09-29 Thread Saurabh Santhosh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151511#comment-14151511
 ] 

Saurabh Santhosh commented on SPARK-3582:
-

Issue resolved by Pull request :
https://github.com/apache/spark/pull/2506

 Spark SQL having issue with existing Hive UDFs which take Map as a parameter
 

 Key: SPARK-3582
 URL: https://issues.apache.org/jira/browse/SPARK-3582
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Saurabh Santhosh
Assignee: Adrian Wang
 Fix For: 1.2.0


 I have a UDF with the following evaluate method :
 public Text evaluate(Text argument, MapText, Text params)
 And when i tried invoking this UDF, i was getting the following error.
 scala.MatchError: interface java.util.Map (of class java.lang.Class)
 at 
 org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:35)
 at 
 org.apache.spark.sql.hive.HiveFunctionRegistry.javaClassToDataType(hiveUdfs.scala:37)
 had a look at HiveInspectors.scala and was not able to see any resolver for 
 java.util.Map 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.

2014-09-29 Thread mohan gaddam (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

mohan gaddam updated SPARK-3601:

Shepherd: Reynold Xin

 Kryo NPE for output operations on Avro complex Objects even after registering.
 --

 Key: SPARK-3601
 URL: https://issues.apache.org/jira/browse/SPARK-3601
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
 Environment: local, standalone cluster
Reporter: mohan gaddam

 Kryo serializer works well when avro objects has simple data. but when the 
 same avro object has complex data(like unions/arrays) kryo fails while output 
 operations. but mappings are good. Note that i have registered all the Avro 
 generated classes with kryo. Im using Java as programming language.
 when used complex message throws NPE, stack trace as follows:
 ==
 ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 
 ms.0 
 org.apache.spark.SparkException: Job aborted due to stage failure: Exception 
 while getting task result: com.esotericsoftware.kryo.KryoException: 
 java.lang.NullPointerException 
 Serialization trace: 
 value (xyz.Datum) 
 data (xyz.ResMsg) 
 at 
 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
  
 at 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) 
 at 
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) 
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  
 at 
 org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
  
 at scala.Option.foreach(Option.scala:236) 
 at 
 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
  
 at 
 org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
  
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) 
 at akka.actor.ActorCell.invoke(ActorCell.scala:456) 
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) 
 at akka.dispatch.Mailbox.run(Mailbox.scala:219) 
 at 
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
 at 
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
 at 
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
  
 In the above exception, Datum and ResMsg are project specific classes 
 generated by avro using the below avdl snippet:
 ==
 record KeyValueObject { 
 union{boolean, int, long, float, double, bytes, string} name; 
 union {boolean, int, long, float, double, bytes, string, 
 arrayunion{boolean, int, long, float, double, bytes, string, 
 KeyValueObject}, KeyValueObject} value; 
 } 
 record Datum { 
 union {boolean, int, long, float, double, bytes, string, 
 arrayunion{boolean, int, long, float, double, bytes, string, 
 KeyValueObject}, KeyValueObject} value; 
 } 
 record ResMsg { 
 string version; 
 string sequence; 
 string resourceGUID; 
 string GWID; 
 string GWTimestamp; 
 union {Datum, arrayDatum} data; 
 }
 avro message samples are as follows:
 
 1)
 {version: 01, sequence: 1, resourceGUID: 001, GWID: 002, 
 GWTimestamp: 1409823150737, data: {value: 30}} 
 2)
 {version: 01, sequence: 1, resource: sensor-001, 
 controller: 002, controllerTimestamp: 1411038710358, data: 
 {value: [ {name: Temperature, value: 30}, {name: Speed, 
 value: 60}, {name: Location, value: [+401213.1, -0750015.1]}, 
 {name: Timestamp, value: 2014-09-09T08:15:25-05:00}]}}
 both 1 and 2 adhere to the avro schema, so decoder is able to convert them 
 into avro objects in spark streaming api.
 BTW the messages were pulled from kafka source, and decoded by using kafka 
 decoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2626) Stop SparkContext in all examples


[ 
https://issues.apache.org/jira/browse/SPARK-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151575#comment-14151575
 ] 

Apache Spark commented on SPARK-2626:
-

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/2575

 Stop SparkContext in all examples
 -

 Key: SPARK-2626
 URL: https://issues.apache.org/jira/browse/SPARK-2626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
 Fix For: 1.2.0


 Event logs rely on sc.stop() to close the file. If this is never closed, the 
 history server will not be able to find the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2805) update akka to version 2.3

2014-09-29 Thread Quinton Anderson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151646#comment-14151646
 ] 

Quinton Anderson commented on SPARK-2805:
-

Any progress on this front? 

 update akka to version 2.3
 --

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-1313) Shark- JDBC driver


 [ 
https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-1313:
-
  Priority: Minor  (was: Blocker)
Issue Type: Question  (was: Task)

 Shark- JDBC driver 
 ---

 Key: SPARK-1313
 URL: https://issues.apache.org/jira/browse/SPARK-1313
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Java API
Reporter: Abhishek Tripathi
Priority: Minor
  Labels: Hive,JDBC, Shark,

 Hi, 
 I'm trying to get JDBC/any driver that can connect to Shark using Java  and 
 execute the Shark/hive query. Can you plz advise if such connector/driver is 
 available ?
 Thanks
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1313) Shark- JDBC driver


 [ 
https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1313.
--
Resolution: Not a Problem

This looks like it was a question more than anything, and was answered.

 Shark- JDBC driver 
 ---

 Key: SPARK-1313
 URL: https://issues.apache.org/jira/browse/SPARK-1313
 Project: Spark
  Issue Type: Question
  Components: Documentation, Examples, Java API
Reporter: Abhishek Tripathi
Priority: Minor
  Labels: Hive,JDBC, Shark,

 Hi, 
 I'm trying to get JDBC/any driver that can connect to Shark using Java  and 
 execute the Shark/hive query. Can you plz advise if such connector/driver is 
 available ?
 Thanks
 Abhishek



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-1884) Shark failed to start


 [ 
https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-1884.
--
Resolution: Won't Fix

This appears to be a protobuf version mismatch, which suggests Shark is being 
used with an unsupported version of Hadoop. As Shark is deprecated and unlikely 
to take steps to support anything else -- and because there is a sort of clear 
path to workaround here if one cared to -- I think this is a WontFix too?

 Shark failed to start
 -

 Key: SPARK-1884
 URL: https://issues.apache.org/jira/browse/SPARK-1884
 Project: Spark
  Issue Type: Bug
Affects Versions: 0.9.1
 Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 
 (stand alone), scala 2.11.0
Reporter: Wei Cui
Priority: Blocker

 the hadoop, hive, spark works fine.
 when start the shark, it failed with the following messages:
 Starting the Shark Command Line Client
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive 
 is deprecated. Instead, use 
 mapreduce.input.fileinputformat.input.dir.recursive
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is 
 deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.rack is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.rack
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.min.split.size.per.node is deprecated. Instead, use 
 mapreduce.input.fileinputformat.split.minsize.per.node
 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is 
 deprecated. Instead, use mapreduce.job.reduces
 14/05/19 16:47:21 INFO Configuration.deprecation: 
 mapred.reduce.tasks.speculative.execution is deprecated. Instead, use 
 mapreduce.reduce.speculative
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.retry.interval;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.local.dir;  Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.job.end-notification.max.attempts;  
 Ignoring.
 14/05/19 16:47:22 WARN conf.Configuration: 
 org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to 
 override final parameter: mapreduce.cluster.temp.dir;  Ignoring.
 Logging initialized using configuration in 
 jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties
 Hive history 
 file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt
 6.004: [GC 279616K-18440K(1013632K), 0.0438980 secs]
 6.445: [Full GC 59125K-7949K(1013632K), 0.0685160 secs]
 Reloading cached RDDs from previous Shark sessions... (use -skipRddReload 
 flag to skip reloading)
 7.535: [Full GC 104136K-13059K(1013632K), 0.0885820 secs]
 8.459: [Full GC 61237K-18031K(1013632K), 0.0820400 secs]
 8.662: [Full GC 29832K-8958K(1013632K), 0.0869700 secs]
 8.751: [Full GC 13433K-8998K(1013632K), 0.0856520 secs]
 10.435: [Full GC 72246K-14140K(1013632K), 0.1797530 secs]
 Exception in thread main org.apache.hadoop.hive.ql.metadata.HiveException: 
 java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072)
   at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49)
   at shark.SharkCliDriver.init(SharkCliDriver.scala:283)
   at shark.SharkCliDriver$.main(SharkCliDriver.scala:162)
   at shark.SharkCliDriver.main(SharkCliDriver.scala)
 Caused by: java.lang.RuntimeException: Unable to instantiate 
 org.apache.hadoop.hive.metastore.HiveMetaStoreClient
   at 
 org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:51)
   at 
 org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288)
   at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299)
   at 
 org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070)
   ... 4 more
 Caused by:

[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files

2014-09-29 Thread Yi Tian (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151683#comment-14151683
]

Yi Tian commented on SPARK-3687:

The stack you print is the worker process.
I think you should print the stack of hung executor.

Spark hang while processing more than 100 sequence files

Key: SPARK-3687
URL: https://issues.apache.org/jira/browse/SPARK-3687
Project: Spark
Issue Type: Bug
Components: Spark Core
Affects Versions: 1.0.2, 1.1.0
Reporter: Ziv Huang

In my application, I read more than 100 sequence files to a JavaPairRDD,
perform flatmap to get another JavaRDD, and then use takeOrdered to get the
result.
It is quite often (but not always) that the spark hangs while the executing
some of 120th-150th tasks.
In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for
its completion).
When the spark job hangs, I can't kill the job from web UI.
In 1.1.0, the job hangs for couple mins (3.x mins actually),
and then web UI of spark master shows that the job is finished with state
FAILED.
In addition, the job stage web UI still hangs, and execution duration time is
still accumulating.
For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere.
The current workaround is to use coalesce to reduce the number of partitions
to be processed.
I never get a job hanged if the number of partitions to be processed is no
greater than 100.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151691#comment-14151691
]

Matthew Farrellee commented on SPARK-3685:
--

[~andrewor] thanks for the info. afaik the executor is also in charge of the
shuffle file life-cycle, and breaking that would be complicated. it's probably
a cleaner implementation to allow executors to remain and use a policy to prune
unused/little-used executors where unused/little-used factors in amount of data
they are holding as well as cpu used. you could also go down the path of
aging-out executors - let their resources go back to the node's pool for
reallocation, but don't kill off the process. however, approaches like that
become very complex and push implementation details of the workload, which
often don't generalize, into the scheduling system.

[~andrewor] btw, it should be a warning case (hey you might have messed up, i
see you used hdfs:/ in your file name) instead of an error case.

Spark's local dir should accept only local paths

Key: SPARK-3685
URL: https://issues.apache.org/jira/browse/SPARK-3685
Project: Spark
Issue Type: Bug
Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it
will try to do is create a folder called hdfs: and put tmp inside it.
This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead
of Hadoop's file system to parse this path. We also need to resolve the path
appropriately.
This may not have an urgent use case, but it fails silently and does what is
least expected.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3720) support ORC in spark sql

2014-09-29 Thread wangfei (JIRA)

wangfei created SPARK-3720:
--

 Summary: support ORC in spark sql
 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei


The Optimized Row Columnar (ORC) file format provides a highly efficient way to 
store data on hdfs.ORC file format has many advantages such as:

1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, 
list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of 
fields



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3720) support ORC in spark sql

2014-09-29 Thread wangfei (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wangfei updated SPARK-3720:
---
Description: 
The Optimized Row Columnar (ORC) file format provides a highly efficient way to 
store data on hdfs.ORC file format has many advantages such as:

1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, 
list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of 
fields

Now spark sql support Parquet, support ORC provide people more opts.

  was:
The Optimized Row Columnar (ORC) file format provides a highly efficient way to 
store data on hdfs.ORC file format has many advantages such as:

1 a single file as the output of each task, which reduces the NameNode's load
2 Hive type support including datetime, decimal, and the complex types (struct, 
list, map, and union)
3 light-weight indexes stored within the file
skip row groups that don't pass predicate filtering
seek to a given row
4 block-mode compression based on data type
run-length encoding for integer columns
dictionary encoding for string columns
5 concurrent reads of the same file using separate RecordReaders
6 ability to split files without scanning for markers
7 bound the amount of memory needed for reading or writing
8 metadata stored using Protocol Buffers, which allows addition and removal of 
fields


 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3720) support ORC in spark sql


[ 
https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151727#comment-14151727
 ] 

Apache Spark commented on SPARK-3720:
-

User 'scwf' has created a pull request for this issue:
https://github.com/apache/spark/pull/2576

 support ORC in spark sql
 

 Key: SPARK-3720
 URL: https://issues.apache.org/jira/browse/SPARK-3720
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 1.1.0
Reporter: wangfei

 The Optimized Row Columnar (ORC) file format provides a highly efficient way 
 to store data on hdfs.ORC file format has many advantages such as:
 1 a single file as the output of each task, which reduces the NameNode's load
 2 Hive type support including datetime, decimal, and the complex types 
 (struct, list, map, and union)
 3 light-weight indexes stored within the file
 skip row groups that don't pass predicate filtering
 seek to a given row
 4 block-mode compression based on data type
 run-length encoding for integer columns
 dictionary encoding for string columns
 5 concurrent reads of the same file using separate RecordReaders
 6 ability to split files without scanning for markers
 7 bound the amount of memory needed for reading or writing
 8 metadata stored using Protocol Buffers, which allows addition and removal 
 of fields
 Now spark sql support Parquet, support ORC provide people more opts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3522) Make spark-ec2 verbosity configurable

2014-09-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151787#comment-14151787
 ] 

Nicholas Chammas commented on SPARK-3522:
-

Always logging to a file sounds like a good idea.

As for changing the shell scripts, doesn't their output get to the user's 
screen by way of [this 
{{check_call()}}|https://github.com/apache/spark/blob/657bdff41a27568a981b3e342ad380fe92aa08a0/ec2/spark_ec2.py#L784]?
 It should be possible to suppress or redirect that output in Python without 
modifying the shell scripts (here's [an 
example|https://github.com/apache/spark/pull/2339/files#diff-ada66bbeb2f1327b508232ef6c3805a5R613]).

 Make spark-ec2 verbosity configurable
 -

 Key: SPARK-3522
 URL: https://issues.apache.org/jira/browse/SPARK-3522
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels 
 like debug output. It would be better for the user if {{spark-ec2}} did the 
 following:
 * default to info output level
 * allow option to increase verbosity and include debug output
 This will require converting most of the {{print}} statements in the script 
 to use Python's {{logging}} module and setting output levels ({{INFO}}, 
 {{WARN}}, {{DEBUG}}) for each statement.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails


[ 
https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151784#comment-14151784
 ] 

Apache Spark commented on SPARK-3627:
-

User 'tgravescs' has created a pull request for this issue:
https://github.com/apache/spark/pull/2577

 spark on yarn reports success even though job fails
 ---

 Key: SPARK-3627
 URL: https://issues.apache.org/jira/browse/SPARK-3627
 Project: Spark
  Issue Type: Bug
  Components: YARN
Affects Versions: 1.2.0
Reporter: Thomas Graves
Assignee: Thomas Graves
Priority: Critical

 I was running a wordcount and saving the output to hdfs.  If the output 
 directory already exists, yarn reports success even though the job fails 
 since it requires the output directory to not be there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3721) Broadcast Variables above 2GB break in PySpark

Brad Miller created SPARK-3721:
--

 Summary: Broadcast Variables above 2GB break in PySpark
 Key: SPARK-3721
 URL: https://issues.apache.org/jira/browse/SPARK-3721
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.1.0
Reporter: Brad Miller


Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1**  [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  [ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  [ERROR 2: no error occurs and *incorrect result* is returned]
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  [ERROR 3: unhandled error from zlib.compress inside sc.broadcast]
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
   msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
   msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle}}
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
{noformat}
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True
{noformat}

**BLOCK 3**  [no problem]
{noformat}
check_unserialized(20)
 correct value recovered: True
{noformat}

**BLOCK 4**  [no problem]
{noformat}
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True
{noformat}

**BLOCK 5**  [no problem]
{noformat}
check_unserialized(27)
 correct value recovered: True
{noformat}

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside {noformat}
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set
{noformat}

**BLOCK 7**  [no problem]
{noformat}
check_unserialized(28)
 correct value recovered: True
{noformat}

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
{noformat}
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False
{noformat}

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
{noformat}
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int
{noformat}

**BLOCK 10**  [ERROR 1]
{noformat}
check_pre_serialized(30)
...same as above...
{noformat}

**BLOCK 11**  [ERROR 3]
{noformat}
check_unserialized(30)
...same as above...
{noformat}


  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads =

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
{noformat}
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True
{noformat}

**BLOCK 3**  [no problem]
{noformat}
check_unserialized(20)
 correct value recovered: True
{noformat}

**BLOCK 4**  [no problem]
{noformat}
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True
{noformat}

**BLOCK 5**  [no problem]
{noformat}
check_unserialized(27)
 correct value recovered: True
{noformat}

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
{noformat}
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set
{noformat}

**BLOCK 7**  [no problem]
{noformat}
check_unserialized(28)
 correct value recovered: True
{noformat}

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
{noformat}
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False
{noformat}

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
{noformat}
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int
{noformat}

**BLOCK 10**  [ERROR 1]
{noformat}
check_pre_serialized(30)
...same as above...
{noformat}

**BLOCK 11**  [ERROR 3]
{noformat}
check_unserialized(30)
...same as above...
{noformat}


  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
{noformat}
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True
{noformat}

**BLOCK 3**  [no problem]
{noformat}
check_unserialized(20)
 correct value recovered: True
{noformat}

**BLOCK 4**  [no problem]
{noformat}
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True
{noformat}

**BLOCK 5**  [no problem]
{noformat}
check_unserialized(27)
 correct value recovered: True
{noformat}

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside {noformat}
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{quote}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{quote}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle \
from pyspark import SparkContext }}

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle}}
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
The bug displays 3 unique failure modes in PySpark, all of which seem to be 
related to broadcast variable size. Note that the tests below ran python 2.7.3 
on all machines and used the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
{noformat}
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True
{noformat}

**BLOCK 3**  [no problem]
{noformat}
check_unserialized(20)
 correct value recovered: True
{noformat}

**BLOCK 4**  [no problem]
{noformat}
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True
{noformat}

**BLOCK 5**  [no problem]
{noformat}
check_unserialized(27)
 correct value recovered: True
{noformat}

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
{noformat}
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set
{noformat}

**BLOCK 7**  [no problem]
{noformat}
check_unserialized(28)
 correct value recovered: True
{noformat}

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
{noformat}
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False
{noformat}

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
{noformat}
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int
{noformat}

**BLOCK 10**  [ERROR 1]
{noformat}
check_pre_serialized(30)
...same as above...
{noformat}

**BLOCK 11**  [ERROR 3]
{noformat}
check_unserialized(30)
...same as above...
{noformat}


  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{noformat}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{noformat}

**BLOCK 2**  [no problem]
{noformat}
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True
{noformat}

**BLOCK 3**  [no problem]
{noformat}
check_unserialized(20)
 correct value recovered: True
{noformat}

**BLOCK 4**  [no problem]
{noformat}
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True
{noformat}

**BLOCK 5**  [no problem]
{noformat}
check_unserialized(27)
 correct value recovered: True
{noformat}

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
{noformat}
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{quote}
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')
{quote}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
{{import cPickle \
from pyspark import SparkContext }}

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')}}

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**

[jira] [Created] (SPARK-3722) Spark on yarn docs work

2014-09-29 Thread WangTaoTheTonic (JIRA)

WangTaoTheTonic created SPARK-3722:
--

 Summary: Spark on yarn docs work
 Key: SPARK-3722
 URL: https://issues.apache.org/jira/browse/SPARK-3722
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: WangTaoTheTonic
Priority: Minor


Adding another way to gain containers' log.
Fix outdated link and typo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3722) Spark on yarn docs work


[ 
https://issues.apache.org/jira/browse/SPARK-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151869#comment-14151869
 ] 

Apache Spark commented on SPARK-3722:
-

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/2579

 Spark on yarn docs work
 ---

 Key: SPARK-3722
 URL: https://issues.apache.org/jira/browse/SPARK-3722
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: WangTaoTheTonic
Priority: Minor

 Adding another way to gain containers' log.
 Fix outdated link and typo.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark


 [ 
https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Brad Miller updated SPARK-3721:
---
Description: 
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1** [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  **[ERROR 1: unhandled error from cPickle.dumps inside 
sc.broadcast]**
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  **[ERROR 2: no error occurs and *incorrect result* is returned]**
check_pre_serialized(29)
 serialized length: 6331339840
 length recovered from broadcast variable: 2036372544
 correct value recovered: False

**BLOCK 9**  **[ERROR 3: unhandled error from zlib.compress inside 
sc.broadcast]**
check_unserialized(29)
..
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 418 
 419 def dumps(self, obj):
 -- 420 return zlib.compress(self.serializer.dumps(obj), 1)
 421 
 422 def loads(self, obj):
 
 OverflowError: size does not fit in an int

**BLOCK 10**  [ERROR 1]
check_pre_serialized(30)
...same as above...

**BLOCK 11**  [ERROR 3]
check_unserialized(30)
...same as above...

  was:
Attempting to reproduce the bug in isolation in iPython notebook I've observed 
the following 3 unique failure modes, all of which seem to be related to 
broadcast variable size. Note that I'm running python 2.7.3 on all machines and 
using the Spark 1.1.0 binaries.

**BLOCK 1**  [no problem]
import cPickle
from pyspark import SparkContext

def check_pre_serialized(size):
msg = cPickle.dumps(range(2 ** size))
print 'serialized length:', len(msg)
bvar = sc.broadcast(msg)
print 'length recovered from broadcast variable:', len(bvar.value)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

def check_unserialized(size):
msg = range(2 ** size)
bvar = sc.broadcast(msg)
print 'correct value recovered:', msg == bvar.value
bvar.unpersist()

SparkContext.setSystemProperty('spark.executor.memory', '15g')
SparkContext.setSystemProperty('spark.cores.max', '5')
sc = SparkContext('spark://crosby.research.intel-research.net:7077', 
'broadcast_bug')

**BLOCK 2**  [no problem]
check_pre_serialized(20)
 serialized length: 9374656
 length recovered from broadcast variable: 9374656
 correct value recovered: True

**BLOCK 3**  [no problem]
check_unserialized(20)
 correct value recovered: True

**BLOCK 4**  [no problem]
check_pre_serialized(27)
 serialized length: 1499501632
 length recovered from broadcast variable: 1499501632
 correct value recovered: True

**BLOCK 5**  [no problem]
check_unserialized(27)
 correct value recovered: True

**BLOCK 6**  [ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]
check_pre_serialized(28)
.
 /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj)
 354
 355 def dumps(self, obj):
 -- 356 return cPickle.dumps(obj, 2)
 357
 358 loads = cPickle.loads

 SystemError: error return without exception set

**BLOCK 7**  [no problem]
check_unserialized(28)
 correct value recovered: True

**BLOCK 8**  [ERROR 2: no error occurs and *incorrect result* is returned]
check_pre_serialized(29)
 serialized length:

[jira] [Updated] (SPARK-3366) Compute best splits distributively in decision tree

2014-09-29 Thread Xiangrui Meng (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-3366:
-
Assignee: Qiping Li

 Compute best splits distributively in decision tree
 ---

 Key: SPARK-3366
 URL: https://issues.apache.org/jira/browse/SPARK-3366
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Xiangrui Meng
Assignee: Qiping Li

 The current implementation computes all best splits locally on the driver, 
 which makes the driver a bottleneck for both communication and computation. 
 It would be nice if we can compute the best splits distributively.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common

2014-09-29 Thread Kousuke Saruta (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kousuke Saruta updated SPARK-3677:
--
Description: 
When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not applied 
to the sources under yarn/common.
I think, this is caused by the directory structure.

  was:When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not 
applied to the sources under yarn/common.


 Scalastyle is never applyed to the sources under yarn/common
 

 Key: SPARK-3677
 URL: https://issues.apache.org/jira/browse/SPARK-3677
 Project: Spark
  Issue Type: Bug
  Components: Build, YARN
Affects Versions: 1.2.0
Reporter: Kousuke Saruta

 When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not 
 applied to the sources under yarn/common.
 I think, this is caused by the directory structure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2626) Stop SparkContext in all examples


 [ 
https://issues.apache.org/jira/browse/SPARK-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-2626:
---
Labels: starter  (was: )

 Stop SparkContext in all examples
 -

 Key: SPARK-2626
 URL: https://issues.apache.org/jira/browse/SPARK-2626
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Andrew Or
  Labels: starter
 Fix For: 1.2.0


 Event logs rely on sc.stop() to close the file. If this is never closed, the 
 history server will not be able to find the logs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2805) update akka to version 2.3


[ 
https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151976#comment-14151976
 ] 

Patrick Wendell commented on SPARK-2805:


I'm working on publishing akka today.

 update akka to version 2.3
 --

 Key: SPARK-2805
 URL: https://issues.apache.org/jira/browse/SPARK-2805
 Project: Spark
  Issue Type: Sub-task
  Components: Build, Spark Core
Reporter: Anand Avati

 akka-2.3 is the lowest version available in Scala 2.11
 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order 
 to reconcile the conflicting dependencies, need to release 
 akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages


[ 
https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151980#comment-14151980
 ] 

Patrick Wendell commented on SPARK-3479:


I think this is more than minor - it would be super useful.

 Have Jenkins show which tests failed in his GitHub messages
 ---

 Key: SPARK-3479
 URL: https://issues.apache.org/jira/browse/SPARK-3479
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas

 A nice thing to do for starters would be to report which category of tests 
 failed (e.g. Python style vs. Spark unit tests), and then later work on 
 showing the specific failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages


 [ 
https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3479:
---
Priority: Major  (was: Minor)

 Have Jenkins show which tests failed in his GitHub messages
 ---

 Key: SPARK-3479
 URL: https://issues.apache.org/jira/browse/SPARK-3479
 Project: Spark
  Issue Type: Sub-task
  Components: Build
Reporter: Nicholas Chammas

 A nice thing to do for starters would be to report which category of tests 
 failed (e.g. Python style vs. Spark unit tests), and then later work on 
 showing the specific failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages


 [ 
https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-3479:
---
Issue Type: Improvement  (was: Sub-task)
Parent: (was: SPARK-2230)

 Have Jenkins show which tests failed in his GitHub messages
 ---

 Key: SPARK-3479
 URL: https://issues.apache.org/jira/browse/SPARK-3479
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas

 A nice thing to do for starters would be to report which category of tests 
 failed (e.g. Python style vs. Spark unit tests), and then later work on 
 showing the specific failures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2885) All-pairs similarity via DIMSUM

2014-09-29 Thread Xiangrui Meng (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Xiangrui Meng resolved SPARK-2885.
--
Resolution: Fixed
Fix Version/s: 1.2.0

Issue resolved by pull request 1778
[https://github.com/apache/spark/pull/1778]

All-pairs similarity via DIMSUM
---

Key: SPARK-2885
URL: https://issues.apache.org/jira/browse/SPARK-2885
Project: Spark
Issue Type: New Feature
Reporter: Reza Zadeh
Assignee: Reza Zadeh
Fix For: 1.2.0

Attachments: SimilarItemsSmallTest.java

Build all-pairs similarity algorithm via DIMSUM.
Given a dataset of sparse vector data, the all-pairs similarity problem is to
find all similar vector pairs according to a similarity function such as
cosine similarity, and a given similarity score threshold. Sometimes, this
problem is called a “similarity join”.
The brute force approach of considering all pairs quickly breaks, since it
scales quadratically. For example, for a million vectors, it is not feasible
to check all roughly trillion pairs to see if they are above the similarity
threshold. Having said that, there exist clever sampling techniques to focus
the computational effort on those pairs that are above the similarity
threshold, which makes the problem feasible.
DIMSUM has a single parameter (called gamma) to tradeoff computation time vs
accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of
computation vs accuracy from low computation to high accuracy. For a very
large gamma, all cosine similarities are computed exactly with no sampling.
Current PR:
https://github.com/apache/spark/pull/1778
Justification for adding to MLlib:
- All-pairs similarity is missing from MLlib and has been requested several
times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see
https://github.com/apache/spark/pull/1778#issuecomment-51300825)
- Algorithm is used in large-scale production at Twitter. e.g. see
https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco
. Twitter also open-sourced their version in scalding:
https://github.com/twitter/scalding/pull/833
- When used with the gamma parameter set high, this algorithm becomes the
normalized gramian matrix, which is useful in RowMatrix alongside the
computeGramianMatrix method already in RowMatrix
More details about usage at Twitter:
https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum
For correctness proof, see Theorem 4.3 in
http://stanford.edu/~rezab/papers/dimsum.pdf

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2230) Improvements to Jenkins QA Harness


 [ 
https://issues.apache.org/jira/browse/SPARK-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2230.

Resolution: Fixed

This was tracking an earlier initiative to clean up this harness. Since we'll 
have ongoing work on this, I'm going to close the umbrella JIRA for now. In the 
future we can just track issues as Project Infra that are related to Jenkins.

 Improvements to Jenkins QA Harness
 --

 Key: SPARK-2230
 URL: https://issues.apache.org/jira/browse/SPARK-2230
 Project: Spark
  Issue Type: Umbrella
  Components: Project Infra
Reporter: Patrick Wendell

 An umbrella for some improvements I'd like to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-29 Thread Matei Zaharia (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matei Zaharia resolved SPARK-3032.
--
   Resolution: Fixed
Fix Version/s: 1.2.0
   1.1.1

 Potential bug when running sort-based shuffle with sorting using TimSort
 

 Key: SPARK-3032
 URL: https://issues.apache.org/jira/browse/SPARK-3032
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Blocker
 Fix For: 1.1.1, 1.2.0


 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
 data type for key and value is (String, String), always meet this issue:
 {noformat}
 java.lang.IllegalArgumentException: Comparison method violates its general 
 contract!
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
 at 
 org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
 at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
 at 
 org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
 at 
 org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
 at 
 org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
 at 
 org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {noformat}
 Seems the current partitionKeyComparator which use hashcode of String as key 
 comparator break some sorting contracts. 
 Also I tested using data type Int as key, this is OK to pass the test, since 
 hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
 of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort

2014-09-29 Thread Matei Zaharia (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152014#comment-14152014
 ] 

Matei Zaharia commented on SPARK-3032:
--

Yup, this will appear in 1.1.1. I've merged it into branch-1.1 already if you 
want to try it.

 Potential bug when running sort-based shuffle with sorting using TimSort
 

 Key: SPARK-3032
 URL: https://issues.apache.org/jira/browse/SPARK-3032
 Project: Spark
  Issue Type: Bug
  Components: Shuffle
Affects Versions: 1.1.0
Reporter: Saisai Shao
Assignee: Saisai Shao
Priority: Blocker
 Fix For: 1.1.1, 1.2.0


 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, 
 data type for key and value is (String, String), always meet this issue:
 {noformat}
 java.lang.IllegalArgumentException: Comparison method violates its general 
 contract!
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493)
 at 
 org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420)
 at 
 org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294)
 at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128)
 at 
 org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83)
 at 
 org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323)
 at 
 org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271)
 at 
 org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249)
 at 
 org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220)
 at 
 org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
 at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:54)
 at 
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
 at java.lang.Thread.run(Thread.java:722)
 {noformat}
 Seems the current partitionKeyComparator which use hashcode of String as key 
 comparator break some sorting contracts. 
 Also I tested using data type Int as key, this is OK to pass the test, since 
 hashcode of Int is its self. So I think potentially partitionDiff + hashcode 
 of String may break the sorting contracts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type


[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152018#comment-14152018
 ] 

Patrick Wendell commented on SPARK-2331:


Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

For now though you can safely cast it to work around this:

{code}
scala sc.emptyRDD[String].asInstanceOf[RDD[String]]
res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at console:14
{code}

 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD has wrong return type


 [ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell resolved SPARK-2331.

Resolution: Won't Fix

 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type


[ 
https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152018#comment-14152018
 ] 

Patrick Wendell edited comment on SPARK-2331 at 9/29/14 6:34 PM:
-

Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

This issue is not related to covariance because here the type parameter is 
always String. So here the compiler actually does understand that 
EmptyRDD[String] is a sub type of RDD[String].

The issue with the original example is that the Scala compiler will always try 
to infer the narrowest type it can. So in a foldLeft expression it will by 
default assume the resulting type is EmptyRDD unless you up cast it to a more 
general type like you are doing. And the union operation requires an exact type 
match on the two RDD's, including the type parameter.


was (Author: pwendell):
Yeah we could have made this a wider type in the public signature. However, it 
is not possible to do that while maintaining compatibility (others may be 
relying on this returning an EmptyRDD).

For now though you can safely cast it to work around this:

{code}
scala sc.emptyRDD[String].asInstanceOf[RDD[String]]
res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at console:14
{code}

 SparkContext.emptyRDD has wrong return type
 ---

 Key: SPARK-2331
 URL: https://issues.apache.org/jira/browse/SPARK-2331
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.0.0
Reporter: Ian Hummel

 The return type for SparkContext.emptyRDD is EmptyRDD[T].
 It should be RDD[T].  That means you have to add extra type annotations on 
 code like the below (which creates a union of RDDs over some subset of paths 
 in a folder)
 val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { 
 (rdd, path) ⇒
   rdd.union(sc.textFile(path))
 }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)

Anant Daksh Asthana created SPARK-3725:
--

 Summary: Link to building spark returns a 404
 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor


The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3726) RandomForest: Support for bootstrap options

Joseph K. Bradley created SPARK-3726:


 Summary: RandomForest: Support for bootstrap options
 Key: SPARK-3726
 URL: https://issues.apache.org/jira/browse/SPARK-3726
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley
Priority: Minor


RandomForest uses BaggedPoint to simulate bootstrapped samples of the data.  
The expected size of each sample is the same as the original data (sampling 
rate = 1.0), and sampling is done with replacement.  Adding support for other 
sampling rates and for sampling without replacement would be useful.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

Joseph K. Bradley created SPARK-3727:


 Summary: DecisionTree, RandomForest: More prediction functionality
 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley


DecisionTree and RandomForest currently predict the most likely label for 
classification and the mean for regression.  Other info about predictions would 
be useful.

For classification: estimated probability of each possible label
For regression: variance of estimate

RandomForest could also create aggregate predictions in multiple ways:
* Predict mean or median value for regression.
* Compute variance of estimates (across all trees) for both classification and 
regression.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3725) Link to building spark returns a 404

2014-09-29 Thread Anant Daksh Asthana (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152047#comment-14152047
 ] 

Anant Daksh Asthana commented on SPARK-3725:


Would it make sense to add a building spark document in the repo. This will 
make it easier to find documentation and any one who has the source will have 
the docs for it as well.

 Link to building spark returns a 404
 

 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3725) Link to building spark returns a 404


[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152050#comment-14152050
 ] 

Sean Owen commented on SPARK-3725:
--

Yes of course, it's already in the repo and has been for a while. It was just 
renamed with a redirect from the old URL. But, that update hasn't hit the 
public site yet.

 Link to building spark returns a 404
 

 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7

2014-09-29 Thread Andrew Davidson (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150695#comment-14150695
 ] 

Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM:


here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

I ran into a small problem the ipython magic %matplotlib inline

creates an error, you can work around this by commenting it out. 

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf \n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n  /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh




was (Author: aedwip):

here is how I am launching iPython notebook. I am running as the ec2-user
IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 
$SPARK_HOME/bin/pyspark

Bellow are all the upgrade commands I ran 

Any idea what I missed?

Andy

yum install -y pssh
yum install -y python27 python27-devel
pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel
wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | 
python27
pssh -h /root/spark-ec2/slaves wget 
https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27
easy_install-2.7 pip
pssh -h /root/spark-ec2/slaves easy_install-2.7 pip
pip2.7 install numpy
pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy
pip2.7 install ipython[all]
printf \n# Set Spark Python version\nexport 
PYSPARK_PYTHON=/usr/bin/python2.7\n  /root/spark/conf/spark-env.sh
source /root/spark/conf/spark-env.sh



 Update Spark AMI to Python 2.7
 --

 Key: SPARK-922
 URL: https://issues.apache.org/jira/browse/SPARK-922
 Project: Spark
  Issue Type: Task
  Components: EC2, PySpark
Affects Versions: 0.9.0, 0.9.1, 1.0.0
Reporter: Josh Rosen
 Fix For: 1.2.0


 Many Python libraries only support Python 2.7+, so we should make Python 2.7 
 the default Python on the Spark AMIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present

Michael Armbrust created SPARK-3729:
---

 Summary: Null-pointer when constructing a HiveContext when 
settings are present
 Key: SPARK-3729
 URL: https://issues.apache.org/jira/browse/SPARK-3729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker


{code}
java.lang.NullPointerException
at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
at 
org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
at 
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78)
at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76)
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present


 [ 
https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust reassigned SPARK-3729:
---

Assignee: Michael Armbrust

 Null-pointer when constructing a HiveContext when settings are present
 --

 Key: SPARK-3729
 URL: https://issues.apache.org/jira/browse/SPARK-3729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 {code}
 java.lang.NullPointerException
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib


[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152101#comment-14152101
 ] 

Joseph K. Bradley commented on SPARK-1547:
--

This will be great to have!  The WIP code and the list of to-do items look good 
to me.

Small comment: For the losses, it would be good to rename residual to either 
pseudoresidual (following Friedman's paper) or to lossGradient (which is 
more literal/accurate).  It would also be nice to have the loss classes compute 
the loss itself, so that we can compute that at the end (and later track it 
along the way).


 Add gradient boosting algorithm to MLlib
 

 Key: SPARK-1547
 URL: https://issues.apache.org/jira/browse/SPARK-1547
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 This task requires adding the gradient boosting algorithm to Spark MLlib. The 
 implementation needs to adapt the gradient boosting algorithm to the scalable 
 tree implementation.
 The tasks involves:
 - Comparing the various tradeoffs and finalizing the algorithm before 
 implementation
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation
 [Ensembles design document (Google doc) | 
 https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present


[ 
https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152113#comment-14152113
 ] 

Apache Spark commented on SPARK-3729:
-

User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/2583

 Null-pointer when constructing a HiveContext when settings are present
 --

 Key: SPARK-3729
 URL: https://issues.apache.org/jira/browse/SPARK-3729
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Michael Armbrust
Priority: Blocker

 {code}
 java.lang.NullPointerException
   at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307)
   at 
 org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270)
   at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79)
   at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78)
   at 
 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
   at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78)
   at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76)
 ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3725) Link to building spark returns a 404


[ 
https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152138#comment-14152138
 ] 

Sean Owen commented on SPARK-3725:
--

No, that links to the raw markdown. Truly, the fix is to rebuild the site. The 
source is fine.

 Link to building spark returns a 404
 

 Key: SPARK-3725
 URL: https://issues.apache.org/jira/browse/SPARK-3725
 Project: Spark
  Issue Type: Documentation
Reporter: Anant Daksh Asthana
Priority: Minor
   Original Estimate: 1m
  Remaining Estimate: 1m

 The README.md link to Building Spark returns a 404



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3730) Any one else having building spark recently


[ 
https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152140#comment-14152140
 ] 

Sean Owen commented on SPARK-3730:
--

(The profile is hadoop-2.3 but that's not the issue.)
I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can 
see from the stack trace. It's not a Spark issue.

 Any one else having building spark recently
 ---

 Key: SPARK-3730
 URL: https://issues.apache.org/jira/browse/SPARK-3730
 Project: Spark
  Issue Type: Question
Reporter: Anant Daksh Asthana
Priority: Minor

 I get an assertion error in 
 spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
 build.
 I am building using 
 mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package
 Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Matthew Farrellee (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152152#comment-14152152
]

Matthew Farrellee commented on SPARK-3685:
--

the root of the resource problem is how they're handed out. yarn is giving you
a whole cpu, some amount of memory, some amount of network and some amount of
disk to work with. your executor (like any program) uses different amounts of
resources throughout its execution. at points in the execution the resource
profile changes, call the demarcated regions phases. so an executor may
transition from a high resource phase to a low resource phase. in a low
resource phase, you may want to free up resources for other executors, but
maintain enough to do basic operations (say: serve a shuffle file). this is a
problem that should be solved by the resource manager. in my opinion, a
solution w/i spark that isn't faciliated by the RN is a workaround/hack and
should be avoided. an example of a RN facilitated solution might be a message
the executor can send to yarn to indicate its resources can be free'd, except
for some minimum amount.

Spark's local dir should accept only local paths

Key: SPARK-3685
URL: https://issues.apache.org/jira/browse/SPARK-3685
Project: Spark
Issue Type: Bug
Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE


 [ 
https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2693:

Priority: Critical  (was: Major)

 Support for UDAF Hive Aggregates like PERCENTILE
 

 Key: SPARK-2693
 URL: https://issues.apache.org/jira/browse/SPARK-2693
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala
Priority: Critical

 {code}
 SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), 
 year,month,day FROM  raw_data_table  GROUP BY year, month, day
 MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an 
 error as shown below.
 Exception in thread main java.lang.RuntimeException: No handler for udf 
 class org.apache.hadoop.hive.ql.udf.UDAFPercentile
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 {code}
 This aggregate extends UDAF, which we don't yet have a wrapper for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE


 [ 
https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2693:

Assignee: Ravindra Pesala

 Support for UDAF Hive Aggregates like PERCENTILE
 

 Key: SPARK-2693
 URL: https://issues.apache.org/jira/browse/SPARK-2693
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Michael Armbrust
Assignee: Ravindra Pesala

 {code}
 SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), 
 year,month,day FROM  raw_data_table  GROUP BY year, month, day
 MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an 
 error as shown below.
 Exception in thread main java.lang.RuntimeException: No handler for udf 
 class org.apache.hadoop.hive.ql.udf.UDAFPercentile
 at scala.sys.package$.error(package.scala:27)
 at 
 org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
 at 
 org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
 at 
 org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
 {code}
 This aggregate extends UDAF, which we don't yet have a wrapper for.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-09-29 Thread Milan Straka (JIRA)

Milan Straka created SPARK-3731:
---

 Summary: RDD caching stops working in pyspark after some time
 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, standalone mode
Reporter: Milan Straka


Consider a file F which when loaded with sc.textFile and cached takes up 
slightly more than half of free memory for RDD cache.

When in PySpark the following is executed:
  1) a = sc.textFile(F)
  2) a.cache().count()
  3) b = sc.textFile(F)
  4) b.cache().count()
and then the following is repeated (for example 10 times):
  a) a.unpersist().cache().count()
  b) b.unpersist().cache().count()
after some time, there are no RDD cached in memory.

Also, since that time, no other RDD ever gets cached (the worker always reports 
something like WARN CacheManager: Not enough space to cache partition rdd_23_5 
in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The 
Executors tab of the Application Detail UI shows that all executors have 0MB 
memory used (which is consistent with the CacheManager warning).

When doing the same in scala, everything works perfectly.

I understand that this is a vague description, but I do no know how to describe 
the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time

2014-09-29 Thread Milan Straka (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milan Straka updated SPARK-3731:

Attachment: worker.log

Sample worker.log showing the problem. For example, consider rdd_1_1. It has 
size 46.3MB. At the beginning, the caching work, but that stops -- the last 
time rdd_1_1 does not fit into cache, the following is reported:
{{14/09/29 21:53:10 WARN CacheManager: Not enough space to cache partition 
rdd_1_1 in memory! Free memory is 148908945 bytes.}}


 RDD caching stops working in pyspark after some time
 

 Key: SPARK-3731
 URL: https://issues.apache.org/jira/browse/SPARK-3731
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
 Environment: Linux, 32bit, standalone mode
Reporter: Milan Straka
 Attachments: worker.log


 Consider a file F which when loaded with sc.textFile and cached takes up 
 slightly more than half of free memory for RDD cache.
 When in PySpark the following is executed:
   1) a = sc.textFile(F)
   2) a.cache().count()
   3) b = sc.textFile(F)
   4) b.cache().count()
 and then the following is repeated (for example 10 times):
   a) a.unpersist().cache().count()
   b) b.unpersist().cache().count()
 after some time, there are no RDD cached in memory.
 Also, since that time, no other RDD ever gets cached (the worker always 
 reports something like WARN CacheManager: Not enough space to cache 
 partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if 
 rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that 
 all executors have 0MB memory used (which is consistent with the CacheManager 
 warning).
 When doing the same in scala, everything works perfectly.
 I understand that this is a vague description, but I do no know how to 
 describe the problem better.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths

2014-09-29 Thread Andrew Or (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152228#comment-14152228
]

Andrew Or commented on SPARK-3685:
--

Not sure if I fully understand what you mean. If I'm running an executor and I
request 30G from the beginning, my application uses all of it to do computation
and all is good. After I decommission the executor, I would like to keep 1G
just to serve the shuffle files, but this can't be done easily because we need
to start a smaller JVM and a smaller container. (Yarn currently doesn't support
scaling the size of a container while it's still running yet). Either way we
need to transfer some state from the bigger JVM to the smaller JVM, and that
adds some complexity to the design. The simplest alternative then would just to
write whatever state to an external location and just terminate the executor
JVM / container without starting a smaller one, and then have an external
service that is long-running to serve these files.

One proposal here then is to write these shuffle files to a special location
and have the Yarn NM shuffle service serve the files. This is an alternative to
DFS shuffle that is, however, highly specific to Yarn. I am doing some initial
prototyping of this (the Yarn shuffle) approach to see how this will pan out.

Spark's local dir should accept only local paths

Key: SPARK-3685
URL: https://issues.apache.org/jira/browse/SPARK-3685
Project: Spark
Issue Type: Bug
Components: Spark Core, YARN
Affects Versions: 1.1.0
Reporter: Andrew Or

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Reynold Xin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152238#comment-14152238
 ] 

Reynold Xin commented on SPARK-3709:


Adding stack trace
{code}
[info] - Unpersisting TorrentBroadcast on executors only in distributed mode 
*** FAILED ***
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 17, localhost): java.io.IOException: sendMessageReliably failed with 
ACK that signalled a remote error
[info] 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:864)
[info] 
org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:856)
[info] 
org.apache.spark.network.nio.ConnectionManager$MessageStatus.markDone(ConnectionManager.scala:61)
[info] 
org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:655)
[info] 
org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515)
[info] 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
[info] 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
[info] java.lang.Thread.run(Thread.java:745)
[info] Driver stacktrace:
[info]   at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1192)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1181)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180)
[info]   at scala.coSpark assembly has been built with Hive, including 
Datanucleus jars on classpath
Spark assembly has been built with Hive, including Datanucleus jars on classpath
llection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1180)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
[info]   at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695)
[info]   at scala.Option.foreach(Option.scala:236)
[info]   at 
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:695)
[info]   ...
{code}

 BroadcastSuite.Unpersisting 
 rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is 
 flaky 
 

 Key: SPARK-3709
 URL: https://issues.apache.org/jira/browse/SPARK-3709
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

Sotos Matzanas created SPARK-3732:
-

 Summary: Yarn Client: Add option to NOT System.exit() at end of 
main()
 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas


We would like to add the ability to create and submit Spark jobs 
programmatically via Scala/Java. We have found a way to hack this and submit 
jobs via Yarn, but since 

org.apache.spark.deploy.yarn.Client.main()

exits with either 0 or 1 in the end, this will mean exit of our own program. We 
would like to add an optional spark conf param to NOT exit at the end of the 
main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib

2014-09-29 Thread Manish Amde (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152251#comment-14152251
 ] 

Manish Amde commented on SPARK-1547:


Sure. I like your naming suggestion. 

I will rebase from the latest master now that the RF PR has been accepted.  I 
will create a WIP PR soon after (with tests and docs) so that we can discuss 
the code in greater detail.

 Add gradient boosting algorithm to MLlib
 

 Key: SPARK-1547
 URL: https://issues.apache.org/jira/browse/SPARK-1547
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.0.0
Reporter: Manish Amde
Assignee: Manish Amde

 This task requires adding the gradient boosting algorithm to Spark MLlib. The 
 implementation needs to adapt the gradient boosting algorithm to the scalable 
 tree implementation.
 The tasks involves:
 - Comparing the various tradeoffs and finalizing the algorithm before 
 implementation
 - Code implementation
 - Unit tests
 - Functional tests
 - Performance tests
 - Documentation
 [Ensembles design document (Google doc) | 
 https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3730) Any one else having building spark recently

2014-09-29 Thread Anant Daksh Asthana (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152269#comment-14152269
 ] 

Anant Daksh Asthana commented on SPARK-3730:


Definately not a spark issue. Just thought some one on here knew a solution.




 Any one else having building spark recently
 ---

 Key: SPARK-3730
 URL: https://issues.apache.org/jira/browse/SPARK-3730
 Project: Spark
  Issue Type: Question
Reporter: Anant Daksh Asthana
Priority: Minor

 I get an assertion error in 
 spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to 
 build.
 I am building using 
 mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package
 Here is the error i get http://pastebin.com/Shi43r53



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature


 [ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3717:
-
Description: 
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Each worker stores:
** a subset of columns (i.e., a subset of features).  If a worker stores 
feature j, then the worker stores the feature value for all instances (i.e., 
the whole column).
** all labels
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations, reduce to total of:
** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
** Estimate: 2^D * M * B * C * 8

h2. Comparing Partitioning Methods

Partitioning features cost  partitioning instances cost when:
* D * (M * 8 + N)  2^D * M * B * C * 8
* D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
right hand side)
* N  [ 2^D * M * B * C * 8 ] / D

Example: many instances:
* 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5)
* Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
* Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8


  was:
h1. Summary

Currently, data are partitioned by row/instance for DecisionTree and 
RandomForest.  This JIRA argues for partitioning by feature for training deep 
trees.  This is especially relevant for random forests, which are often trained 
to be deeper than single decision trees.

h1. Details

Dataset dimensions and the depth of the tree to be trained are the main problem 
parameters determining whether it is better to partition features or instances. 
 For random forests (training many deep trees), partitioning features could be 
much better.

Notation:
* P = # workers
* N = # instances
* M = # features
* D = depth of tree

h2. Partitioning Features

Algorithm sketch:
* Train one level at a time.
* Invariants:
** Each worker stores a mapping: instance → node in current level
* On each iteration:
** Each worker: For each node in level, compute (best feature to split, info 
gain).
** Reduce (P x M) values to M values to find best split for each node.
** Workers who have features used in best splits communicate left/right for 
relevant instances.  Gather total of N bits to master, then broadcast.
* Total communication:
** Depth D iterations
** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 
bit each).
** Estimate: D * (M * 8 + N)

h2. Partitioning Instances

Algorithm sketch:
* Train one group of nodes at a time.
* Invariants:
 * Each worker stores a mapping: instance → node
* On each iteration:
** Each worker: For each instance, add to aggregate statistics.
** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
*** (“# classes” is for classification.  3 for regression)
** Reduce aggregate.
** Master chooses best split for each node in group and broadcasts.
* Local training: Once all instances for a node fit on one machine, it can be 
best to shuffle data and training subtrees locally.  This can mean shuffling 
the entire dataset for each tree trained.
* Summing over all iterations,

[jira] [Commented] (SPARK-3434) Distributed block matrix

2014-09-29 Thread Reza Zadeh (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152275#comment-14152275
 ] 

Reza Zadeh commented on SPARK-3434:
---

It looks like Shivaram Venkataraman from the AMPlab has started work on this. I 
will be meeting with him to see if we can reuse some his work.

 Distributed block matrix
 

 Key: SPARK-3434
 URL: https://issues.apache.org/jira/browse/SPARK-3434
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: Xiangrui Meng

 This JIRA is for discussing distributed matrices stored in block 
 sub-matrices. The main challenge is the partitioning scheme to allow adding 
 linear algebra operations in the future, e.g.:
 1. matrix multiplication
 2. matrix factorization (QR, LU, ...)
 Let's discuss the partitioning and storage and how they fit into the above 
 use cases.
 Questions:
 1. Should it be backed by a single RDD that contains all of the sub-matrices 
 or many RDDs with each contains only one sub-matrix?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152291#comment-14152291
 ] 

Apache Spark commented on SPARK-3732:
-

User 'smatzana' has created a pull request for this issue:
https://github.com/apache/spark/pull/2584

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152294#comment-14152294
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

I think that explicit System.exit() could just be removed. Exposing an option 
for this sounds like overkill.

[~tgraves] had some comments about that call in the past, though.

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152302#comment-14152302
 ] 

Sotos Matzanas commented on SPARK-3732:
---

we added the option as insurance against backward compatibility, removing the 
System.exit() call will obviously work unless somebody is checking the exit 
code from spark-submit explicitely

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Thomas Graves (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152308#comment-14152308
 ] 

Thomas Graves commented on SPARK-3732:
--

I think you should just change the name of this jira to add support for 
programmatically calling the spark yarn Client.  As you have found this isn't 
currently supported and I wouldn't want to just remove the exit and say its 
supported without thinking about other implications.   


 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152309#comment-14152309
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

Removing the call should work regardless; it's redundant, since the code will 
just exit normally anyway after that.

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()

2014-09-29 Thread Marcelo Vanzin (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152318#comment-14152318
 ] 

Marcelo Vanzin commented on SPARK-3732:
---

BTW, if the call is removed, it should be possible to do what you want more 
generically by calling {{SparkSubmit.main}} directly. That's still a little 
fishy, but it's a lot less fishy than calling Yarn-specific code directly.

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3708) Backticks aren't handled correctly is aliases

2014-09-29 Thread Ravindra Pesala (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152321#comment-14152321
 ] 

Ravindra Pesala commented on SPARK-3708:


I guess here you mentioned about HiveContext as there is no support of backtick 
in SqlContext.  I will work on this issue.Thank you.

 Backticks aren't handled correctly is aliases
 -

 Key: SPARK-3708
 URL: https://issues.apache.org/jira/browse/SPARK-3708
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.1.0
Reporter: Michael Armbrust

 Here's a failing test case:
 {code}
 sql(SELECT k FROM (SELECT `key` AS `k` FROM src) a)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()


[ 
https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152323#comment-14152323
 ] 

Sotos Matzanas commented on SPARK-3732:
---

[~tgraves] this jira is the first step for us to move forward with our own 
(very limited for now) version of programmatic job submits. I can add another 
jira to address the big issue, but we would like to see this one resolved 
first. Once we are confident with our solution we can contribute to the second 
jira. Let us know what you think

 Yarn Client: Add option to NOT System.exit() at end of main()
 -

 Key: SPARK-3732
 URL: https://issues.apache.org/jira/browse/SPARK-3732
 Project: Spark
  Issue Type: Improvement
Affects Versions: 1.1.0
Reporter: Sotos Matzanas
   Original Estimate: 1h
  Remaining Estimate: 1h

 We would like to add the ability to create and submit Spark jobs 
 programmatically via Scala/Java. We have found a way to hack this and submit 
 jobs via Yarn, but since 
 org.apache.spark.deploy.yarn.Client.main()
 exits with either 0 or 1 in the end, this will mean exit of our own program. 
 We would like to add an optional spark conf param to NOT exit at the end of 
 the main



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-3733) Support for programmatically submitting Spark jobs

Sotos Matzanas created SPARK-3733:
-

 Summary: Support for programmatically submitting Spark jobs
 Key: SPARK-3733
 URL: https://issues.apache.org/jira/browse/SPARK-3733
 Project: Spark
  Issue Type: New Feature
Affects Versions: 1.1.0
Reporter: Sotos Matzanas


Right now it's impossible to programmatically submit Spark jobs via a Scala (or 
Java) API. We would like to see that in a future version of Spark



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature

2014-09-29 Thread Sung Chung (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152347#comment-14152347
 ] 

Sung Chung commented on SPARK-3717:
---

I think that this would be great as an alternative option.

1. Partitioning by rows (as currently implemented) scales in # of rows.
2. Partitioning by features scales in # of features.

With good modularization, I think a lot of tree logic (splitting, building 
trees) could be shared among the different partitioning schemes.

 DecisionTree, RandomForest: Partition by feature
 

 Key: SPARK-3717
 URL: https://issues.apache.org/jira/browse/SPARK-3717
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 h1. Summary
 Currently, data are partitioned by row/instance for DecisionTree and 
 RandomForest.  This JIRA argues for partitioning by feature for training deep 
 trees.  This is especially relevant for random forests, which are often 
 trained to be deeper than single decision trees.
 h1. Details
 Dataset dimensions and the depth of the tree to be trained are the main 
 problem parameters determining whether it is better to partition features or 
 instances.  For random forests (training many deep trees), partitioning 
 features could be much better.
 Notation:
 * P = # workers
 * N = # instances
 * M = # features
 * D = depth of tree
 h2. Partitioning Features
 Algorithm sketch:
 * Each worker stores:
 ** a subset of columns (i.e., a subset of features).  If a worker stores 
 feature j, then the worker stores the feature value for all instances (i.e., 
 the whole column).
 ** all labels
 * Train one level at a time.
 * Invariants:
 ** Each worker stores a mapping: instance → node in current level
 * On each iteration:
 ** Each worker: For each node in level, compute (best feature to split, info 
 gain).
 ** Reduce (P x M) values to M values to find best split for each node.
 ** Workers who have features used in best splits communicate left/right for 
 relevant instances.  Gather total of N bits to master, then broadcast.
 * Total communication:
 ** Depth D iterations
 ** On each iteration, reduce to M values (~8 bytes each), broadcast N values 
 (1 bit each).
 ** Estimate: D * (M * 8 + N)
 h2. Partitioning Instances
 Algorithm sketch:
 * Train one group of nodes at a time.
 * Invariants:
  * Each worker stores a mapping: instance → node
 * On each iteration:
 ** Each worker: For each instance, add to aggregate statistics.
 ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes)
 *** (“# classes” is for classification.  3 for regression)
 ** Reduce aggregate.
 ** Master chooses best split for each node in group and broadcasts.
 * Local training: Once all instances for a node fit on one machine, it can be 
 best to shuffle data and training subtrees locally.  This can mean shuffling 
 the entire dataset for each tree trained.
 * Summing over all iterations, reduce to total of:
 ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each)
 ** Estimate: 2^D * M * B * C * 8
 h2. Comparing Partitioning Methods
 Partitioning features cost  partitioning instances cost when:
 * D * (M * 8 + N)  2^D * M * B * C * 8
 * D * N  2^D * M * B * C * 8  (assuming D * M * 8 is small compared to the 
 right hand side)
 * N  [ 2^D * M * B * C * 8 ] / D
 Example: many instances:
 * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 
 5)
 * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7
 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR

2014-09-29 Thread Arun Ahuja (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152351#comment-14152351
 ] 

Arun Ahuja commented on SPARK-3630:
---

We have seen this issue as well:

{code}
com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
uncompress the chunk: PARSING_ERROR(2)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:142)
at com.esotericsoftware.kryo.io.Input.require(Input.java:155)
at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337)
at 
com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109)
at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610)
at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721)
at 
org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133)
at 
org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
...
Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2)
at 
org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362)
at 
org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159)
at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142)
at com.esotericsoftware.kryo.io.Input.fill(Input.java:140)
{code}

This is with Spark 1.1, running on a Yarn cluster.  The issue seems to be 
fairly frequent but does not happen on every run.  

 Identify cause of Kryo+Snappy PARSING_ERROR
 ---

 Key: SPARK-3630
 URL: https://issues.apache.org/jira/browse/SPARK-3630
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Affects Versions: 1.1.0
Reporter: Andrew Ash
Assignee: Ankur Dave

 A recent GraphX commit caused non-deterministic exceptions in unit tests so 
 it was reverted (see SPARK-3400).
 Separately, [~aash] observed the same exception stacktrace in an 
 application-specific Kryo registrator:
 {noformat}
 com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to 
 uncompress the chunk: PARSING_ERROR(2)
 com.esotericsoftware.kryo.io.Input.fill(Input.java:142) 
 com.esotericsoftware.kryo.io.Input.require(Input.java:169) 
 com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) 
 com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) 
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127)
  
 com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) 
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109)
  
 com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18)
  
 com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732)
 ...
 {noformat}
 This ticket is to identify the cause of the exception in the GraphX commit so 
 the faulty commit can be fixed and merged back into master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky

2014-09-29 Thread Reynold Xin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-3709:
---
Assignee: Reynold Xin  (was: Cheng Lian)

 BroadcastSuite.Unpersisting 
 rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is 
 flaky 
 

 Key: SPARK-3709
 URL: https://issues.apache.org/jira/browse/SPARK-3709
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Reynold Xin
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky