[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151387#comment-14151387 ] Egor Pakhomov commented on SPARK-3714: -- Yes, I tried, please see [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] for explanation why Oozie is not good enough Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
Joseph K. Bradley created SPARK-3717: Summary: DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the RHS) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3717: - Description: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 was: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning
[jira] [Created] (SPARK-3718) FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory
Kousuke Saruta created SPARK-3718: - Summary: FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory Key: SPARK-3718 URL: https://issues.apache.org/jira/browse/SPARK-3718 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor It's a minor improvement. FsHistoryProvider reads event logs from the directory represented as spark.history.fs.logDirectory, but I think the directory is nearly equal the directory represented as spark.eventLog.dir so we should consider spark.eventLog.dir too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-3567) appId field in SparkDeploySchedulerBackend should be volatile
[ https://issues.apache.org/jira/browse/SPARK-3567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta closed SPARK-3567. - Resolution: Fixed appId field in SparkDeploySchedulerBackend should be volatile - Key: SPARK-3567 URL: https://issues.apache.org/jira/browse/SPARK-3567 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.2.0 Reporter: Kousuke Saruta Assignee: Kousuke Saruta Fix For: 1.2.0 appId field in SparkDeploySchedulerBackend is set by AppClient.ClientActor#receiveWithLogging and appId is referred via SparkDeploySchedulerBackend#applicationId. A thread which runs AppClient.ClientActor and a thread invoking SparkDeploySchedulerBackend#applicationId can be another threads so appId should be volatile. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2516) Bootstrapping
[ https://issues.apache.org/jira/browse/SPARK-2516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151416#comment-14151416 ] Yu Ishikawa commented on SPARK-2516: Thank you for assigning me with this issue and sharing the paper with me. I understand that we need to abstract the record type and the estimator interface. Bootstrapping - Key: SPARK-2516 URL: https://issues.apache.org/jira/browse/SPARK-2516 Project: Spark Issue Type: Sub-task Components: MLlib Reporter: Xiangrui Meng Assignee: Yu Ishikawa Support re-sampling and bootstrap estimators in MLlib. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3718) FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory
[ https://issues.apache.org/jira/browse/SPARK-3718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151418#comment-14151418 ] Apache Spark commented on SPARK-3718: - User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/2573 FsHistoryProvider should consider spark.eventLog.dir not only spark.history.fs.logDirectory --- Key: SPARK-3718 URL: https://issues.apache.org/jira/browse/SPARK-3718 Project: Spark Issue Type: Improvement Components: Web UI Affects Versions: 1.2.0 Reporter: Kousuke Saruta Priority: Minor It's a minor improvement. FsHistoryProvider reads event logs from the directory represented as spark.history.fs.logDirectory, but I think the directory is nearly equal the directory represented as spark.eventLog.dir so we should consider spark.eventLog.dir too. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151422#comment-14151422 ] Sean Owen commented on SPARK-3714: -- Another meta-question for everyone: at what point should a project like this simply be a separate add-on project? For example, Oozie is a stand-alone project. Not everything needs to happen directly under the Spark umbrella, which is already broad. One upside to including it is that Spark is it perhaps gets more attention. Spark is forced to maintain and keep it compatible, which is also a downside I suppose. There is also the effect that you create an official workflow engine and discourage others. I am more asking the question than suggesting an answer, but, my reaction was that this could live outside Spark just fine. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151432#comment-14151432 ] Sean Owen commented on SPARK-3274: -- I don't think that's the same thing. It is just saying you are reading a {{SequenceFile}} of {{Text}} and then pretending they are Strings. Are you sure the first {{return}} statement works? They will both work as expected if you just call {{.toString()}} on the {{Text}} objects you are actually operating on. Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151447#comment-14151447 ] Pulkit Bhuwalka commented on SPARK-3274: [~sowen] - you are right. I was making the mistake of reading the sequence file as String instead of text. Addind toString fixed the problem. Thanks a lot for your help. Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3274) Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream
[ https://issues.apache.org/jira/browse/SPARK-3274?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151447#comment-14151447 ] Pulkit Bhuwalka edited comment on SPARK-3274 at 9/29/14 7:27 AM: - [~sowen] - you are right. I was making the mistake of reading the sequence file as String instead of Text. Adding toString fixed the problem. Thanks a lot for your help. was (Author: pulkit.bosc...@gmail.com): [~sowen] - you are right. I was making the mistake of reading the sequence file as String instead of text. Addind toString fixed the problem. Thanks a lot for your help. Spark Streaming Java API reports java.lang.ClassCastException when calling collectAsMap on JavaPairDStream -- Key: SPARK-3274 URL: https://issues.apache.org/jira/browse/SPARK-3274 Project: Spark Issue Type: Bug Components: Java API Affects Versions: 1.0.2 Reporter: Jack Hu Reproduce code: scontext .socketTextStream(localhost, 1) .mapToPair(new PairFunctionString, String, String(){ public Tuple2String, String call(String arg0) throws Exception { return new Tuple2String, String(1, arg0); } }) .foreachRDD(new Function2JavaPairRDDString, String, Time, Void() { public Void call(JavaPairRDDString, String v1, Time v2) throws Exception { System.out.println(v2.toString() + : + v1.collectAsMap().toString()); return null; } }); Exception: java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lscala.Tupl e2; at org.apache.spark.rdd.PairRDDFunctions.collectAsMap(PairRDDFunctions.s cala:447) at org.apache.spark.api.java.JavaPairRDD.collectAsMap(JavaPairRDD.scala: 464) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:90) at tuk.usecase.failedcall.FailedCall$1.call(FailedCall.java:88) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachR DD$2.apply(JavaDStreamLike.scala:282) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mc V$sp(ForEachDStream.scala:41) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(Fo rEachDStream.scala:40) at scala.util.Try$.apply(Try.scala:161) at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32) at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobS -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2159) Spark shell exit() does not stop SparkContext
[ https://issues.apache.org/jira/browse/SPARK-2159?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2159. -- Resolution: Won't Fix Fix Version/s: (was: 1.2.0) The discussion in the PR suggests this is WontFix. https://github.com/apache/spark/pull/1230#issuecomment-54045637 Spark shell exit() does not stop SparkContext - Key: SPARK-2159 URL: https://issues.apache.org/jira/browse/SPARK-2159 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Priority: Minor If you type exit() in spark shell, it is equivalent to a Ctrl+C and does not stop the SparkContext. This is used very commonly to exit a shell, and it would be good if it is equivalent to Ctrl+D instead, which does stop the SparkContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151458#comment-14151458 ] Egor Pakhomov commented on SPARK-3714: -- I agree with your concerns - it should be separate from Spark codebase project. But it can't be workflow engine for common purposes like oozie. There are too many essential spark specific things in new engine, like supporting spark context, providing capability to write simple jobs in scala from HUE. I think for such engine best approach is like ooyala job server - separate project, but spark oriented. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2643) Stages web ui has ERROR when pool name is None
[ https://issues.apache.org/jira/browse/SPARK-2643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-2643. -- Resolution: Fixed Discussion suggests this was fixed by a related change: https://github.com/apache/spark/pull/1854#issuecomment-55061571 Stages web ui has ERROR when pool name is None -- Key: SPARK-2643 URL: https://issues.apache.org/jira/browse/SPARK-2643 Project: Spark Issue Type: Bug Components: Web UI Reporter: YanTang Zhai Priority: Minor 14/07/23 16:01:44 WARN servlet.ServletHandler: /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTableBase.stageRow(StageTable.scala:132) at org.apache.spark.ui.jobs.StageTableBase.org$apache$spark$ui$jobs$StageTableBase$$renderStageRow(StageTable.scala:150) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$toNodeSeq$1.apply(StageTable.scala:52) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at org.apache.spark.ui.jobs.StageTableBase$$anonfun$stageTable$1.apply(StageTable.scala:61) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$$anonfun$map$1.apply(Stream.scala:376) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085) at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$$anonfun$next$1.apply(Stream.scala:980) at scala.collection.immutable.StreamIterator$LazyCell.v$lzycompute(Stream.scala:969) at scala.collection.immutable.StreamIterator$LazyCell.v(Stream.scala:969) at scala.collection.immutable.StreamIterator.hasNext(Stream.scala:974) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:38) at scala.xml.NodeBuffer.$amp$plus(NodeBuffer.scala:40) at org.apache.spark.ui.jobs.StageTableBase.stageTable(StageTable.scala:60) at org.apache.spark.ui.jobs.StageTableBase.toNodeSeq(StageTable.scala:52) at org.apache.spark.ui.jobs.JobProgressPage.render(JobProgressPage.scala:91) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.WebUI$$anonfun$attachPage$1.apply(WebUI.scala:65) at org.apache.spark.ui.JettyUtils$$anon$1.doGet(JettyUtils.scala:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:684) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:501) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1086) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:428) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1020) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:370) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:494) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:971) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:1033) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:644) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:667) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at
[jira] [Resolved] (SPARK-1208) after some hours of working the :4040 monitoring UI stops working.
[ https://issues.apache.org/jira/browse/SPARK-1208?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1208. -- Resolution: Fixed This appears to be a similar, if not the same issue, as in SPARK-2643. The discussion in the PR indicates this was resolved by a subsequent change: https://github.com/apache/spark/pull/1854#issuecomment-55061571 after some hours of working the :4040 monitoring UI stops working. -- Key: SPARK-1208 URL: https://issues.apache.org/jira/browse/SPARK-1208 Project: Spark Issue Type: Bug Components: Web UI Affects Versions: 0.9.0 Reporter: Tal Sliwowicz This issue is inconsistent, but it did not exist in prior versions. The Driver app otherwise works normally. The log file below is from the driver. 2014-03-09 07:24:55,837 WARN [qtp1187052686-17453] AbstractHttpConnection - /stages/ java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:313) at scala.None$.get(Option.scala:311) at org.apache.spark.ui.jobs.StageTable.org$apache$spark$ui$jobs$StageTable$$stageRow(StageTable.scala:114) at org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39) at org.apache.spark.ui.jobs.StageTable$$anonfun$toNodeSeq$1.apply(StageTable.scala:39) at org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57) at org.apache.spark.ui.jobs.StageTable$$anonfun$stageTable$1.apply(StageTable.scala:57) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.ui.jobs.StageTable.stageTable(StageTable.scala:57) at org.apache.spark.ui.jobs.StageTable.toNodeSeq(StageTable.scala:39) at org.apache.spark.ui.jobs.IndexPage.render(IndexPage.scala:81) at org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59) at org.apache.spark.ui.jobs.JobProgressUI$$anonfun$getHandlers$3.apply(JobProgressUI.scala:59) at org.apache.spark.ui.JettyUtils$$anon$1.handle(JettyUtils.scala:61) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1040) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:976) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116) at org.eclipse.jetty.server.Server.handle(Server.java:363) at org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:483) at org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttpConnection.java:920) at org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComplete(AbstractHttpConnection.java:982) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:635) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235) at org.eclipse.jetty.server.AsyncHttpConnection.handle(AsyncHttpConnection.java:82) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:628) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:52) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543) at java.lang.Thread.run(Thread.java:662) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3203) ClassNotFoundException in spark-shell with Cassandra
[ https://issues.apache.org/jira/browse/SPARK-3203?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-3203: - Summary: ClassNotFoundException in spark-shell with Cassandra (was: ClassNotFound Exception) ClassNotFoundException in spark-shell with Cassandra Key: SPARK-3203 URL: https://issues.apache.org/jira/browse/SPARK-3203 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2 Environment: Ubuntu 12.04, openjdk 64 bit 7u65 Reporter: Rohit Kumar I am using Spark with as processing engine over cassandra. I have only one master and a worker node. I am executing following code in spark-shell : sc.stop import org.apache.spark.SparkContext import org.apache.spark.SparkConf import com.datastax.spark.connector._ val conf = new SparkConf(true).set(spark.cassandra.connection.host, 127.0.0.1) val sc = new SparkContext(spark://L-BXP44Z1:7077, Cassandra Connector Test, conf) val rdd = sc.cassandraTable(test, kv) println(rdd.map(_.getInt(value)).sum) I am getting following error: 14/08/25 18:47:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 14/08/25 18:49:39 INFO CoarseGrainedExecutorBackend: Got assigned task 0 14/08/25 18:49:39 INFO Executor: Running task ID 0 14/08/25 18:49:39 ERROR Executor: Exception in task ID 0 java.lang.ClassNotFoundException: $line29.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1 at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:270) at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:60) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at scala.collection.immutable.$colon$colon.readObject(List.scala:362) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) at org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) at org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) at java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1837) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1796) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350) at
[jira] [Updated] (SPARK-1381) Spark to Shark direct streaming
[ https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1381: - Priority: Major (was: Blocker) It sounds like this is WontFix at this point, if there was a problem to begin with, as Shark is deprecated. Spark to Shark direct streaming --- Key: SPARK-1381 URL: https://issues.apache.org/jira/browse/SPARK-1381 Project: Spark Issue Type: Question Components: Documentation, Examples, Input/Output, Java API, Spark Core Affects Versions: 0.8.1 Reporter: Abhishek Tripathi Labels: performance Hi, I'm trying to push data coming from Spark streaming to Shark cache table. I thought of using JDBC API but Shark(0.81) does not support direct insert statement i.e insert into emp values(2, Apia) . I don't want to store Spark streaming into HDFS and then copy that data to Shark table. Can somebody plz help 1. how can I directly point Spark streaming data to Shark table/cachedTable ? otherway how can Shark pickup data directly from Spark streaming? 2. Does Shark0.81 has direct insert statement without referring to other table? It is really stopping us to use Spark further more. need your assistant urgently. Thanks in advance. Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1381) Spark to Shark direct streaming
[ https://issues.apache.org/jira/browse/SPARK-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1381. -- Resolution: Won't Fix Spark to Shark direct streaming --- Key: SPARK-1381 URL: https://issues.apache.org/jira/browse/SPARK-1381 Project: Spark Issue Type: Question Components: Documentation, Examples, Input/Output, Java API, Spark Core Affects Versions: 0.8.1 Reporter: Abhishek Tripathi Labels: performance Hi, I'm trying to push data coming from Spark streaming to Shark cache table. I thought of using JDBC API but Shark(0.81) does not support direct insert statement i.e insert into emp values(2, Apia) . I don't want to store Spark streaming into HDFS and then copy that data to Shark table. Can somebody plz help 1. how can I directly point Spark streaming data to Shark table/cachedTable ? otherway how can Shark pickup data directly from Spark streaming? 2. Does Shark0.81 has direct insert statement without referring to other table? It is really stopping us to use Spark further more. need your assistant urgently. Thanks in advance. Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3714) Spark workflow scheduler
[ https://issues.apache.org/jira/browse/SPARK-3714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151486#comment-14151486 ] Mridul Muralidharan commented on SPARK-3714: Most of the drawbacks mentioned are not severe imo - at best, they are unfamiliarity with oozie platform (points 2, 3, 4, 5). Point 1 is interesting (sharing spark context) - though from a fault tolerance point of view, it makes supporting it challenging; ofcourse oozie was not, probably, designed with something like spark in mind - so there might be changes to oozie which might benefit spark; we could engage with oozie dev for that. But discarding it to reinvent something when oozie already does everything mentioned in requirements section seems counterintutive. I have seen multiple attempts to 'simplify' workflow management, and at production scale almost everything ends up being similar ... Note that most production jobs have to depend on a variety of jobs - not just spark or MR - so you will end up converigng on a variant of oozie anyway :-) Having said that, if you want to take a crack at solving this with spark specific idioms in mind, it would be interesting to see the result - I dont want to dissuade from doing so ! We might end up with something quite interesting. Spark workflow scheduler Key: SPARK-3714 URL: https://issues.apache.org/jira/browse/SPARK-3714 Project: Spark Issue Type: New Feature Components: Project Infra Reporter: Egor Pakhomov Priority: Minor [Design doc | https://docs.google.com/document/d/1q2Q8Ux-6uAkH7wtLJpc3jz-GfrDEjlbWlXtf20hvguk/edit?usp=sharing] Spark stack currently hard to use in the production processes due to the lack of next features: * Scheduling spark jobs * Retrying failed spark job in big pipeline * Share context among jobs in pipeline * Queue jobs Typical usecase for such platform would be - wait for new data, process new data, learn ML models on new data, compare model with previous one, in case of success - rewrite model in HDFS directory for current production model with new one. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages
uncleGen created SPARK-3719: --- Summary: Spark UI: complete/failed stages is better to show the total number of stages Key: SPARK-3719 URL: https://issues.apache.org/jira/browse/SPARK-3719 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3719) Spark UI: complete/failed stages is better to show the total number of stages
[ https://issues.apache.org/jira/browse/SPARK-3719?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151490#comment-14151490 ] Apache Spark commented on SPARK-3719: - User 'uncleGen' has created a pull request for this issue: https://github.com/apache/spark/pull/2574 Spark UI: complete/failed stages is better to show the total number of stages Key: SPARK-3719 URL: https://issues.apache.org/jira/browse/SPARK-3719 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: uncleGen Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3582) Spark SQL having issue with existing Hive UDFs which take Map as a parameter
[ https://issues.apache.org/jira/browse/SPARK-3582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151511#comment-14151511 ] Saurabh Santhosh commented on SPARK-3582: - Issue resolved by Pull request : https://github.com/apache/spark/pull/2506 Spark SQL having issue with existing Hive UDFs which take Map as a parameter Key: SPARK-3582 URL: https://issues.apache.org/jira/browse/SPARK-3582 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Saurabh Santhosh Assignee: Adrian Wang Fix For: 1.2.0 I have a UDF with the following evaluate method : public Text evaluate(Text argument, MapText, Text params) And when i tried invoking this UDF, i was getting the following error. scala.MatchError: interface java.util.Map (of class java.lang.Class) at org.apache.spark.sql.hive.HiveInspectors$class.javaClassToDataType(HiveInspectors.scala:35) at org.apache.spark.sql.hive.HiveFunctionRegistry.javaClassToDataType(hiveUdfs.scala:37) had a look at HiveInspectors.scala and was not able to see any resolver for java.util.Map -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3601) Kryo NPE for output operations on Avro complex Objects even after registering.
[ https://issues.apache.org/jira/browse/SPARK-3601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] mohan gaddam updated SPARK-3601: Shepherd: Reynold Xin Kryo NPE for output operations on Avro complex Objects even after registering. -- Key: SPARK-3601 URL: https://issues.apache.org/jira/browse/SPARK-3601 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.0 Environment: local, standalone cluster Reporter: mohan gaddam Kryo serializer works well when avro objects has simple data. but when the same avro object has complex data(like unions/arrays) kryo fails while output operations. but mappings are good. Note that i have registered all the Avro generated classes with kryo. Im using Java as programming language. when used complex message throws NPE, stack trace as follows: == ERROR scheduler.JobScheduler: Error running job streaming job 1411043845000 ms.0 org.apache.spark.SparkException: Job aborted due to stage failure: Exception while getting task result: com.esotericsoftware.kryo.KryoException: java.lang.NullPointerException Serialization trace: value (xyz.Datum) data (xyz.ResMsg) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) In the above exception, Datum and ResMsg are project specific classes generated by avro using the below avdl snippet: == record KeyValueObject { union{boolean, int, long, float, double, bytes, string} name; union {boolean, int, long, float, double, bytes, string, arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, KeyValueObject} value; } record Datum { union {boolean, int, long, float, double, bytes, string, arrayunion{boolean, int, long, float, double, bytes, string, KeyValueObject}, KeyValueObject} value; } record ResMsg { string version; string sequence; string resourceGUID; string GWID; string GWTimestamp; union {Datum, arrayDatum} data; } avro message samples are as follows: 1) {version: 01, sequence: 1, resourceGUID: 001, GWID: 002, GWTimestamp: 1409823150737, data: {value: 30}} 2) {version: 01, sequence: 1, resource: sensor-001, controller: 002, controllerTimestamp: 1411038710358, data: {value: [ {name: Temperature, value: 30}, {name: Speed, value: 60}, {name: Location, value: [+401213.1, -0750015.1]}, {name: Timestamp, value: 2014-09-09T08:15:25-05:00}]}} both 1 and 2 adhere to the avro schema, so decoder is able to convert them into avro objects in spark streaming api. BTW the messages were pulled from kafka source, and decoded by using kafka decoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2626) Stop SparkContext in all examples
[ https://issues.apache.org/jira/browse/SPARK-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151575#comment-14151575 ] Apache Spark commented on SPARK-2626: - User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/2575 Stop SparkContext in all examples - Key: SPARK-2626 URL: https://issues.apache.org/jira/browse/SPARK-2626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Fix For: 1.2.0 Event logs rely on sc.stop() to close the file. If this is never closed, the history server will not be able to find the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2805) update akka to version 2.3
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151646#comment-14151646 ] Quinton Anderson commented on SPARK-2805: - Any progress on this front? update akka to version 2.3 -- Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-1313) Shark- JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-1313: - Priority: Minor (was: Blocker) Issue Type: Question (was: Task) Shark- JDBC driver --- Key: SPARK-1313 URL: https://issues.apache.org/jira/browse/SPARK-1313 Project: Spark Issue Type: Question Components: Documentation, Examples, Java API Reporter: Abhishek Tripathi Priority: Minor Labels: Hive,JDBC, Shark, Hi, I'm trying to get JDBC/any driver that can connect to Shark using Java and execute the Shark/hive query. Can you plz advise if such connector/driver is available ? Thanks Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1313) Shark- JDBC driver
[ https://issues.apache.org/jira/browse/SPARK-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1313. -- Resolution: Not a Problem This looks like it was a question more than anything, and was answered. Shark- JDBC driver --- Key: SPARK-1313 URL: https://issues.apache.org/jira/browse/SPARK-1313 Project: Spark Issue Type: Question Components: Documentation, Examples, Java API Reporter: Abhishek Tripathi Priority: Minor Labels: Hive,JDBC, Shark, Hi, I'm trying to get JDBC/any driver that can connect to Shark using Java and execute the Shark/hive query. Can you plz advise if such connector/driver is available ? Thanks Abhishek -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-1884) Shark failed to start
[ https://issues.apache.org/jira/browse/SPARK-1884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-1884. -- Resolution: Won't Fix This appears to be a protobuf version mismatch, which suggests Shark is being used with an unsupported version of Hadoop. As Shark is deprecated and unlikely to take steps to support anything else -- and because there is a sort of clear path to workaround here if one cared to -- I think this is a WontFix too? Shark failed to start - Key: SPARK-1884 URL: https://issues.apache.org/jira/browse/SPARK-1884 Project: Spark Issue Type: Bug Affects Versions: 0.9.1 Environment: ubuntu 14.04, spark 0.9.1, hive 0.13.0, hadoop 2.4.0 (stand alone), scala 2.11.0 Reporter: Wei Cui Priority: Blocker the hadoop, hive, spark works fine. when start the shark, it failed with the following messages: Starting the Shark Command Line Client 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.input.dir.recursive is deprecated. Instead, use mapreduce.input.fileinputformat.input.dir.recursive 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.max.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.maxsize 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size.per.rack is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.rack 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.min.split.size.per.node is deprecated. Instead, use mapreduce.input.fileinputformat.split.minsize.per.node 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces 14/05/19 16:47:21 INFO Configuration.deprecation: mapred.reduce.tasks.speculative.execution is deprecated. Instead, use mapreduce.reduce.speculative 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.cluster.local.dir; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring. 14/05/19 16:47:22 WARN conf.Configuration: org.apache.hadoop.hive.conf.LoopingByteArrayInputStream@48c724c:an attempt to override final parameter: mapreduce.cluster.temp.dir; Ignoring. Logging initialized using configuration in jar:file:/usr/local/shark/lib_managed/jars/edu.berkeley.cs.shark/hive-common/hive-common-0.11.0-shark-0.9.1.jar!/hive-log4j.properties Hive history file=/tmp/root/hive_job_log_root_14857@ubuntu_201405191647_897494215.txt 6.004: [GC 279616K-18440K(1013632K), 0.0438980 secs] 6.445: [Full GC 59125K-7949K(1013632K), 0.0685160 secs] Reloading cached RDDs from previous Shark sessions... (use -skipRddReload flag to skip reloading) 7.535: [Full GC 104136K-13059K(1013632K), 0.0885820 secs] 8.459: [Full GC 61237K-18031K(1013632K), 0.0820400 secs] 8.662: [Full GC 29832K-8958K(1013632K), 0.0869700 secs] 8.751: [Full GC 13433K-8998K(1013632K), 0.0856520 secs] 10.435: [Full GC 72246K-14140K(1013632K), 0.1797530 secs] Exception in thread main org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1072) at shark.memstore2.TableRecovery$.reloadRdds(TableRecovery.scala:49) at shark.SharkCliDriver.init(SharkCliDriver.scala:283) at shark.SharkCliDriver$.main(SharkCliDriver.scala:162) at shark.SharkCliDriver.main(SharkCliDriver.scala) Caused by: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.metastore.HiveMetaStoreClient at org.apache.hadoop.hive.metastore.MetaStoreUtils.newInstance(MetaStoreUtils.java:1139) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.init(RetryingMetaStoreClient.java:51) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.getProxy(RetryingMetaStoreClient.java:61) at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:2288) at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:2299) at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1070) ... 4 more Caused by:
[jira] [Commented] (SPARK-3687) Spark hang while processing more than 100 sequence files
[ https://issues.apache.org/jira/browse/SPARK-3687?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151683#comment-14151683 ] Yi Tian commented on SPARK-3687: The stack you print is the worker process. I think you should print the stack of hung executor. Spark hang while processing more than 100 sequence files Key: SPARK-3687 URL: https://issues.apache.org/jira/browse/SPARK-3687 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.2, 1.1.0 Reporter: Ziv Huang In my application, I read more than 100 sequence files to a JavaPairRDD, perform flatmap to get another JavaRDD, and then use takeOrdered to get the result. It is quite often (but not always) that the spark hangs while the executing some of 120th-150th tasks. In 1.0.2, the job can hang for several hours, maybe forever (I can't wait for its completion). When the spark job hangs, I can't kill the job from web UI. In 1.1.0, the job hangs for couple mins (3.x mins actually), and then web UI of spark master shows that the job is finished with state FAILED. In addition, the job stage web UI still hangs, and execution duration time is still accumulating. For both 1.0.2 and 1.1.0, the job hangs with no error messages in anywhere. The current workaround is to use coalesce to reduce the number of partitions to be processed. I never get a job hanged if the number of partitions to be processed is no greater than 100. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151691#comment-14151691 ] Matthew Farrellee commented on SPARK-3685: -- [~andrewor] thanks for the info. afaik the executor is also in charge of the shuffle file life-cycle, and breaking that would be complicated. it's probably a cleaner implementation to allow executors to remain and use a policy to prune unused/little-used executors where unused/little-used factors in amount of data they are holding as well as cpu used. you could also go down the path of aging-out executors - let their resources go back to the node's pool for reallocation, but don't kill off the process. however, approaches like that become very complex and push implementation details of the workload, which often don't generalize, into the scheduling system. [~andrewor] btw, it should be a warning case (hey you might have messed up, i see you used hdfs:/ in your file name) instead of an error case. Spark's local dir should accept only local paths Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3720) support ORC in spark sql
wangfei created SPARK-3720: -- Summary: support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wangfei updated SPARK-3720: --- Description: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. was: The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3720) support ORC in spark sql
[ https://issues.apache.org/jira/browse/SPARK-3720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151727#comment-14151727 ] Apache Spark commented on SPARK-3720: - User 'scwf' has created a pull request for this issue: https://github.com/apache/spark/pull/2576 support ORC in spark sql Key: SPARK-3720 URL: https://issues.apache.org/jira/browse/SPARK-3720 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 1.1.0 Reporter: wangfei The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data on hdfs.ORC file format has many advantages such as: 1 a single file as the output of each task, which reduces the NameNode's load 2 Hive type support including datetime, decimal, and the complex types (struct, list, map, and union) 3 light-weight indexes stored within the file skip row groups that don't pass predicate filtering seek to a given row 4 block-mode compression based on data type run-length encoding for integer columns dictionary encoding for string columns 5 concurrent reads of the same file using separate RecordReaders 6 ability to split files without scanning for markers 7 bound the amount of memory needed for reading or writing 8 metadata stored using Protocol Buffers, which allows addition and removal of fields Now spark sql support Parquet, support ORC provide people more opts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3522) Make spark-ec2 verbosity configurable
[ https://issues.apache.org/jira/browse/SPARK-3522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151787#comment-14151787 ] Nicholas Chammas commented on SPARK-3522: - Always logging to a file sounds like a good idea. As for changing the shell scripts, doesn't their output get to the user's screen by way of [this {{check_call()}}|https://github.com/apache/spark/blob/657bdff41a27568a981b3e342ad380fe92aa08a0/ec2/spark_ec2.py#L784]? It should be possible to suppress or redirect that output in Python without modifying the shell scripts (here's [an example|https://github.com/apache/spark/pull/2339/files#diff-ada66bbeb2f1327b508232ef6c3805a5R613]). Make spark-ec2 verbosity configurable - Key: SPARK-3522 URL: https://issues.apache.org/jira/browse/SPARK-3522 Project: Spark Issue Type: Improvement Components: EC2 Reporter: Nicholas Chammas Priority: Minor When launching a cluster, {{spark-ec2}} spits out a lot of stuff that feels like debug output. It would be better for the user if {{spark-ec2}} did the following: * default to info output level * allow option to increase verbosity and include debug output This will require converting most of the {{print}} statements in the script to use Python's {{logging}} module and setting output levels ({{INFO}}, {{WARN}}, {{DEBUG}}) for each statement. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3627) spark on yarn reports success even though job fails
[ https://issues.apache.org/jira/browse/SPARK-3627?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151784#comment-14151784 ] Apache Spark commented on SPARK-3627: - User 'tgravescs' has created a pull request for this issue: https://github.com/apache/spark/pull/2577 spark on yarn reports success even though job fails --- Key: SPARK-3627 URL: https://issues.apache.org/jira/browse/SPARK-3627 Project: Spark Issue Type: Bug Components: YARN Affects Versions: 1.2.0 Reporter: Thomas Graves Assignee: Thomas Graves Priority: Critical I was running a wordcount and saving the output to hdfs. If the output directory already exists, yarn reports success even though the job fails since it requires the output directory to not be there. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
Brad Miller created SPARK-3721: -- Summary: Broadcast Variables above 2GB break in PySpark Key: SPARK-3721 URL: https://issues.apache.org/jira/browse/SPARK-3721 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.1.0 Reporter: Brad Miller Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** [ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast] check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** [ERROR 2: no error occurs and *incorrect result* is returned] check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** [ERROR 3: unhandled error from zlib.compress inside sc.broadcast] check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29)
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle}} from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29)
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside {noformat} sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set {noformat} **BLOCK 7** [no problem] {noformat} check_unserialized(28) correct value recovered: True {noformat} **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** {noformat} check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False {noformat} **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** {noformat} check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int {noformat} **BLOCK 10** [ERROR 1] {noformat} check_pre_serialized(30) ...same as above... {noformat} **BLOCK 11** [ERROR 3] {noformat} check_unserialized(30) ...same as above... {noformat} was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads =
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** {noformat} check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set {noformat} **BLOCK 7** [no problem] {noformat} check_unserialized(28) correct value recovered: True {noformat} **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** {noformat} check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False {noformat} **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** {noformat} check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int {noformat} **BLOCK 10** [ERROR 1] {noformat} check_pre_serialized(30) ...same as above... {noformat} **BLOCK 11** [ERROR 3] {noformat} check_unserialized(30) ...same as above... {noformat} was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside {noformat} sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {quote} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {quote} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]**
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle \ from pyspark import SparkContext }} def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle}} from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29)
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: The bug displays 3 unique failure modes in PySpark, all of which seem to be related to broadcast variable size. Note that the tests below ran python 2.7.3 on all machines and used the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** {noformat} check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set {noformat} **BLOCK 7** [no problem] {noformat} check_unserialized(28) correct value recovered: True {noformat} **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** {noformat} check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False {noformat} **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** {noformat} check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int {noformat} **BLOCK 10** [ERROR 1] {noformat} check_pre_serialized(30) ...same as above... {noformat} **BLOCK 11** [ERROR 3] {noformat} check_unserialized(30) ...same as above... {noformat} was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {noformat} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {noformat} **BLOCK 2** [no problem] {noformat} check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True {noformat} **BLOCK 3** [no problem] {noformat} check_unserialized(20) correct value recovered: True {noformat} **BLOCK 4** [no problem] {noformat} check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True {noformat} **BLOCK 5** [no problem] {noformat} check_unserialized(27) correct value recovered: True {noformat} **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** {noformat} check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {quote} import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') {quote} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] {{import cPickle \ from pyspark import SparkContext }} def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug')}} **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]**
[jira] [Created] (SPARK-3722) Spark on yarn docs work
WangTaoTheTonic created SPARK-3722: -- Summary: Spark on yarn docs work Key: SPARK-3722 URL: https://issues.apache.org/jira/browse/SPARK-3722 Project: Spark Issue Type: Improvement Components: Documentation Reporter: WangTaoTheTonic Priority: Minor Adding another way to gain containers' log. Fix outdated link and typo. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3722) Spark on yarn docs work
[ https://issues.apache.org/jira/browse/SPARK-3722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151869#comment-14151869 ] Apache Spark commented on SPARK-3722: - User 'WangTaoTheTonic' has created a pull request for this issue: https://github.com/apache/spark/pull/2579 Spark on yarn docs work --- Key: SPARK-3722 URL: https://issues.apache.org/jira/browse/SPARK-3722 Project: Spark Issue Type: Improvement Components: Documentation Reporter: WangTaoTheTonic Priority: Minor Adding another way to gain containers' log. Fix outdated link and typo. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3721) Broadcast Variables above 2GB break in PySpark
[ https://issues.apache.org/jira/browse/SPARK-3721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brad Miller updated SPARK-3721: --- Description: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** **[ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast]** check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** **[ERROR 2: no error occurs and *incorrect result* is returned]** check_pre_serialized(29) serialized length: 6331339840 length recovered from broadcast variable: 2036372544 correct value recovered: False **BLOCK 9** **[ERROR 3: unhandled error from zlib.compress inside sc.broadcast]** check_unserialized(29) .. /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 418 419 def dumps(self, obj): -- 420 return zlib.compress(self.serializer.dumps(obj), 1) 421 422 def loads(self, obj): OverflowError: size does not fit in an int **BLOCK 10** [ERROR 1] check_pre_serialized(30) ...same as above... **BLOCK 11** [ERROR 3] check_unserialized(30) ...same as above... was: Attempting to reproduce the bug in isolation in iPython notebook I've observed the following 3 unique failure modes, all of which seem to be related to broadcast variable size. Note that I'm running python 2.7.3 on all machines and using the Spark 1.1.0 binaries. **BLOCK 1** [no problem] import cPickle from pyspark import SparkContext def check_pre_serialized(size): msg = cPickle.dumps(range(2 ** size)) print 'serialized length:', len(msg) bvar = sc.broadcast(msg) print 'length recovered from broadcast variable:', len(bvar.value) print 'correct value recovered:', msg == bvar.value bvar.unpersist() def check_unserialized(size): msg = range(2 ** size) bvar = sc.broadcast(msg) print 'correct value recovered:', msg == bvar.value bvar.unpersist() SparkContext.setSystemProperty('spark.executor.memory', '15g') SparkContext.setSystemProperty('spark.cores.max', '5') sc = SparkContext('spark://crosby.research.intel-research.net:7077', 'broadcast_bug') **BLOCK 2** [no problem] check_pre_serialized(20) serialized length: 9374656 length recovered from broadcast variable: 9374656 correct value recovered: True **BLOCK 3** [no problem] check_unserialized(20) correct value recovered: True **BLOCK 4** [no problem] check_pre_serialized(27) serialized length: 1499501632 length recovered from broadcast variable: 1499501632 correct value recovered: True **BLOCK 5** [no problem] check_unserialized(27) correct value recovered: True **BLOCK 6** [ERROR 1: unhandled error from cPickle.dumps inside sc.broadcast] check_pre_serialized(28) . /home/spark/greatest/python/pyspark/serializers.py in dumps(self, obj) 354 355 def dumps(self, obj): -- 356 return cPickle.dumps(obj, 2) 357 358 loads = cPickle.loads SystemError: error return without exception set **BLOCK 7** [no problem] check_unserialized(28) correct value recovered: True **BLOCK 8** [ERROR 2: no error occurs and *incorrect result* is returned] check_pre_serialized(29) serialized length:
[jira] [Updated] (SPARK-3366) Compute best splits distributively in decision tree
[ https://issues.apache.org/jira/browse/SPARK-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-3366: - Assignee: Qiping Li Compute best splits distributively in decision tree --- Key: SPARK-3366 URL: https://issues.apache.org/jira/browse/SPARK-3366 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Xiangrui Meng Assignee: Qiping Li The current implementation computes all best splits locally on the driver, which makes the driver a bottleneck for both communication and computation. It would be nice if we can compute the best splits distributively. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3677) Scalastyle is never applyed to the sources under yarn/common
[ https://issues.apache.org/jira/browse/SPARK-3677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kousuke Saruta updated SPARK-3677: -- Description: When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not applied to the sources under yarn/common. I think, this is caused by the directory structure. was:When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not applied to the sources under yarn/common. Scalastyle is never applyed to the sources under yarn/common Key: SPARK-3677 URL: https://issues.apache.org/jira/browse/SPARK-3677 Project: Spark Issue Type: Bug Components: Build, YARN Affects Versions: 1.2.0 Reporter: Kousuke Saruta When we run sbt -Pyarn scalastyle or mvn package, scalastyle is not applied to the sources under yarn/common. I think, this is caused by the directory structure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2626) Stop SparkContext in all examples
[ https://issues.apache.org/jira/browse/SPARK-2626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-2626: --- Labels: starter (was: ) Stop SparkContext in all examples - Key: SPARK-2626 URL: https://issues.apache.org/jira/browse/SPARK-2626 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.0.1 Reporter: Andrew Or Labels: starter Fix For: 1.2.0 Event logs rely on sc.stop() to close the file. If this is never closed, the history server will not be able to find the logs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2805) update akka to version 2.3
[ https://issues.apache.org/jira/browse/SPARK-2805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151976#comment-14151976 ] Patrick Wendell commented on SPARK-2805: I'm working on publishing akka today. update akka to version 2.3 -- Key: SPARK-2805 URL: https://issues.apache.org/jira/browse/SPARK-2805 Project: Spark Issue Type: Sub-task Components: Build, Spark Core Reporter: Anand Avati akka-2.3 is the lowest version available in Scala 2.11 akka-2.3 depends on protobuf 2.5. Hadoop-1 requires protobuf 2.4.1. In order to reconcile the conflicting dependencies, need to release akka-2.3.x-shaded-protobuf artifact which has protobuf 2.5 within. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages
[ https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14151980#comment-14151980 ] Patrick Wendell commented on SPARK-3479: I think this is more than minor - it would be super useful. Have Jenkins show which tests failed in his GitHub messages --- Key: SPARK-3479 URL: https://issues.apache.org/jira/browse/SPARK-3479 Project: Spark Issue Type: Sub-task Components: Build Reporter: Nicholas Chammas A nice thing to do for starters would be to report which category of tests failed (e.g. Python style vs. Spark unit tests), and then later work on showing the specific failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages
[ https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3479: --- Priority: Major (was: Minor) Have Jenkins show which tests failed in his GitHub messages --- Key: SPARK-3479 URL: https://issues.apache.org/jira/browse/SPARK-3479 Project: Spark Issue Type: Sub-task Components: Build Reporter: Nicholas Chammas A nice thing to do for starters would be to report which category of tests failed (e.g. Python style vs. Spark unit tests), and then later work on showing the specific failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3479) Have Jenkins show which tests failed in his GitHub messages
[ https://issues.apache.org/jira/browse/SPARK-3479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-3479: --- Issue Type: Improvement (was: Sub-task) Parent: (was: SPARK-2230) Have Jenkins show which tests failed in his GitHub messages --- Key: SPARK-3479 URL: https://issues.apache.org/jira/browse/SPARK-3479 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas A nice thing to do for starters would be to report which category of tests failed (e.g. Python style vs. Spark unit tests), and then later work on showing the specific failures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2885) All-pairs similarity via DIMSUM
[ https://issues.apache.org/jira/browse/SPARK-2885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2885. -- Resolution: Fixed Fix Version/s: 1.2.0 Issue resolved by pull request 1778 [https://github.com/apache/spark/pull/1778] All-pairs similarity via DIMSUM --- Key: SPARK-2885 URL: https://issues.apache.org/jira/browse/SPARK-2885 Project: Spark Issue Type: New Feature Reporter: Reza Zadeh Assignee: Reza Zadeh Fix For: 1.2.0 Attachments: SimilarItemsSmallTest.java Build all-pairs similarity algorithm via DIMSUM. Given a dataset of sparse vector data, the all-pairs similarity problem is to find all similar vector pairs according to a similarity function such as cosine similarity, and a given similarity score threshold. Sometimes, this problem is called a “similarity join”. The brute force approach of considering all pairs quickly breaks, since it scales quadratically. For example, for a million vectors, it is not feasible to check all roughly trillion pairs to see if they are above the similarity threshold. Having said that, there exist clever sampling techniques to focus the computational effort on those pairs that are above the similarity threshold, which makes the problem feasible. DIMSUM has a single parameter (called gamma) to tradeoff computation time vs accuracy. Setting gamma from 1 to the largest magnitude allows tradeoff of computation vs accuracy from low computation to high accuracy. For a very large gamma, all cosine similarities are computed exactly with no sampling. Current PR: https://github.com/apache/spark/pull/1778 Justification for adding to MLlib: - All-pairs similarity is missing from MLlib and has been requested several times, e.g. http://bit.ly/XAFGs8 and separately by Jeremy Freeman (see https://github.com/apache/spark/pull/1778#issuecomment-51300825) - Algorithm is used in large-scale production at Twitter. e.g. see https://blog.twitter.com/2012/dimension-independent-similarity-computation-disco . Twitter also open-sourced their version in scalding: https://github.com/twitter/scalding/pull/833 - When used with the gamma parameter set high, this algorithm becomes the normalized gramian matrix, which is useful in RowMatrix alongside the computeGramianMatrix method already in RowMatrix More details about usage at Twitter: https://blog.twitter.com/2014/all-pairs-similarity-via-dimsum For correctness proof, see Theorem 4.3 in http://stanford.edu/~rezab/papers/dimsum.pdf -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2230) Improvements to Jenkins QA Harness
[ https://issues.apache.org/jira/browse/SPARK-2230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2230. Resolution: Fixed This was tracking an earlier initiative to clean up this harness. Since we'll have ongoing work on this, I'm going to close the umbrella JIRA for now. In the future we can just track issues as Project Infra that are related to Jenkins. Improvements to Jenkins QA Harness -- Key: SPARK-2230 URL: https://issues.apache.org/jira/browse/SPARK-2230 Project: Spark Issue Type: Umbrella Components: Project Infra Reporter: Patrick Wendell An umbrella for some improvements I'd like to do. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Matei Zaharia resolved SPARK-3032. -- Resolution: Fixed Fix Version/s: 1.2.0 1.1.1 Potential bug when running sort-based shuffle with sorting using TimSort Key: SPARK-3032 URL: https://issues.apache.org/jira/browse/SPARK-3032 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Blocker Fix For: 1.1.1, 1.2.0 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, data type for key and value is (String, String), always meet this issue: {noformat} java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) at org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) at org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) at org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} Seems the current partitionKeyComparator which use hashcode of String as key comparator break some sorting contracts. Also I tested using data type Int as key, this is OK to pass the test, since hashcode of Int is its self. So I think potentially partitionDiff + hashcode of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3032) Potential bug when running sort-based shuffle with sorting using TimSort
[ https://issues.apache.org/jira/browse/SPARK-3032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152014#comment-14152014 ] Matei Zaharia commented on SPARK-3032: -- Yup, this will appear in 1.1.1. I've merged it into branch-1.1 already if you want to try it. Potential bug when running sort-based shuffle with sorting using TimSort Key: SPARK-3032 URL: https://issues.apache.org/jira/browse/SPARK-3032 Project: Spark Issue Type: Bug Components: Shuffle Affects Versions: 1.1.0 Reporter: Saisai Shao Assignee: Saisai Shao Priority: Blocker Fix For: 1.1.1, 1.2.0 When using SparkPerf's aggregate-by-key workload to test sort-based shuffle, data type for key and value is (String, String), always meet this issue: {noformat} java.lang.IllegalArgumentException: Comparison method violates its general contract! at org.apache.spark.util.collection.Sorter$SortState.mergeLo(Sorter.java:755) at org.apache.spark.util.collection.Sorter$SortState.mergeAt(Sorter.java:493) at org.apache.spark.util.collection.Sorter$SortState.mergeCollapse(Sorter.java:420) at org.apache.spark.util.collection.Sorter$SortState.access$200(Sorter.java:294) at org.apache.spark.util.collection.Sorter.sort(Sorter.java:128) at org.apache.spark.util.collection.SizeTrackingPairBuffer.destructiveSortedIterator(SizeTrackingPairBuffer.scala:83) at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:323) at org.apache.spark.util.collection.ExternalSorter.spill(ExternalSorter.scala:271) at org.apache.spark.util.collection.ExternalSorter.maybeSpill(ExternalSorter.scala:249) at org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:220) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:85) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:54) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) {noformat} Seems the current partitionKeyComparator which use hashcode of String as key comparator break some sorting contracts. Also I tested using data type Int as key, this is OK to pass the test, since hashcode of Int is its self. So I think potentially partitionDiff + hashcode of String may break the sorting contracts. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152018#comment-14152018 ] Patrick Wendell commented on SPARK-2331: Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). For now though you can safely cast it to work around this: {code} scala sc.emptyRDD[String].asInstanceOf[RDD[String]] res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at console:14 {code} SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell resolved SPARK-2331. Resolution: Won't Fix SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-2331) SparkContext.emptyRDD has wrong return type
[ https://issues.apache.org/jira/browse/SPARK-2331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152018#comment-14152018 ] Patrick Wendell edited comment on SPARK-2331 at 9/29/14 6:34 PM: - Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). This issue is not related to covariance because here the type parameter is always String. So here the compiler actually does understand that EmptyRDD[String] is a sub type of RDD[String]. The issue with the original example is that the Scala compiler will always try to infer the narrowest type it can. So in a foldLeft expression it will by default assume the resulting type is EmptyRDD unless you up cast it to a more general type like you are doing. And the union operation requires an exact type match on the two RDD's, including the type parameter. was (Author: pwendell): Yeah we could have made this a wider type in the public signature. However, it is not possible to do that while maintaining compatibility (others may be relying on this returning an EmptyRDD). For now though you can safely cast it to work around this: {code} scala sc.emptyRDD[String].asInstanceOf[RDD[String]] res7: org.apache.spark.rdd.RDD[String] = EmptyRDD[3] at emptyRDD at console:14 {code} SparkContext.emptyRDD has wrong return type --- Key: SPARK-2331 URL: https://issues.apache.org/jira/browse/SPARK-2331 Project: Spark Issue Type: Bug Affects Versions: 1.0.0 Reporter: Ian Hummel The return type for SparkContext.emptyRDD is EmptyRDD[T]. It should be RDD[T]. That means you have to add extra type annotations on code like the below (which creates a union of RDDs over some subset of paths in a folder) val rdds = Seq(a, b, c).foldLeft[RDD[String]](sc.emptyRDD[String]) { (rdd, path) ⇒ rdd.union(sc.textFile(path)) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3725) Link to building spark returns a 404
Anant Daksh Asthana created SPARK-3725: -- Summary: Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3726) RandomForest: Support for bootstrap options
Joseph K. Bradley created SPARK-3726: Summary: RandomForest: Support for bootstrap options Key: SPARK-3726 URL: https://issues.apache.org/jira/browse/SPARK-3726 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley Priority: Minor RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. The expected size of each sample is the same as the original data (sampling rate = 1.0), and sampling is done with replacement. Adding support for other sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
Joseph K. Bradley created SPARK-3727: Summary: DecisionTree, RandomForest: More prediction functionality Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152047#comment-14152047 ] Anant Daksh Asthana commented on SPARK-3725: Would it make sense to add a building spark document in the repo. This will make it easier to find documentation and any one who has the source will have the docs for it as well. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152050#comment-14152050 ] Sean Owen commented on SPARK-3725: -- Yes of course, it's already in the repo and has been for a while. It was just renamed with a redirect from the old URL. But, that update hasn't hit the public site yet. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-922) Update Spark AMI to Python 2.7
[ https://issues.apache.org/jira/browse/SPARK-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14150695#comment-14150695 ] Andrew Davidson edited comment on SPARK-922 at 9/29/14 7:05 PM: here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran I ran into a small problem the ipython magic %matplotlib inline creates an error, you can work around this by commenting it out. Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf \n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh was (Author: aedwip): here is how I am launching iPython notebook. I am running as the ec2-user IPYTHON_OPTS=notebook --pylab inline --no-browser --port=7000 $SPARK_HOME/bin/pyspark Bellow are all the upgrade commands I ran Any idea what I missed? Andy yum install -y pssh yum install -y python27 python27-devel pssh -h /root/spark-ec2/slaves yum install -y python27 python27-devel wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 pssh -h /root/spark-ec2/slaves wget https://bitbucket.org/pypa/setuptools/raw/bootstrap/ez_setup.py -O - | python27 easy_install-2.7 pip pssh -h /root/spark-ec2/slaves easy_install-2.7 pip pip2.7 install numpy pssh -t0 -h /root/spark-ec2/slaves pip2.7 install numpy pip2.7 install ipython[all] printf \n# Set Spark Python version\nexport PYSPARK_PYTHON=/usr/bin/python2.7\n /root/spark/conf/spark-env.sh source /root/spark/conf/spark-env.sh Update Spark AMI to Python 2.7 -- Key: SPARK-922 URL: https://issues.apache.org/jira/browse/SPARK-922 Project: Spark Issue Type: Task Components: EC2, PySpark Affects Versions: 0.9.0, 0.9.1, 1.0.0 Reporter: Josh Rosen Fix For: 1.2.0 Many Python libraries only support Python 2.7+, so we should make Python 2.7 the default Python on the Spark AMIs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
Michael Armbrust created SPARK-3729: --- Summary: Null-pointer when constructing a HiveContext when settings are present Key: SPARK-3729 URL: https://issues.apache.org/jira/browse/SPARK-3729 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Priority: Blocker {code} java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
[ https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust reassigned SPARK-3729: --- Assignee: Michael Armbrust Null-pointer when constructing a HiveContext when settings are present -- Key: SPARK-3729 URL: https://issues.apache.org/jira/browse/SPARK-3729 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker {code} java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152101#comment-14152101 ] Joseph K. Bradley commented on SPARK-1547: -- This will be great to have! The WIP code and the list of to-do items look good to me. Small comment: For the losses, it would be good to rename residual to either pseudoresidual (following Friedman's paper) or to lossGradient (which is more literal/accurate). It would also be nice to have the loss classes compute the loss itself, so that we can compute that at the end (and later track it along the way). Add gradient boosting algorithm to MLlib Key: SPARK-1547 URL: https://issues.apache.org/jira/browse/SPARK-1547 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde This task requires adding the gradient boosting algorithm to Spark MLlib. The implementation needs to adapt the gradient boosting algorithm to the scalable tree implementation. The tasks involves: - Comparing the various tradeoffs and finalizing the algorithm before implementation - Code implementation - Unit tests - Functional tests - Performance tests - Documentation [Ensembles design document (Google doc) | https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3729) Null-pointer when constructing a HiveContext when settings are present
[ https://issues.apache.org/jira/browse/SPARK-3729?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152113#comment-14152113 ] Apache Spark commented on SPARK-3729: - User 'marmbrus' has created a pull request for this issue: https://github.com/apache/spark/pull/2583 Null-pointer when constructing a HiveContext when settings are present -- Key: SPARK-3729 URL: https://issues.apache.org/jira/browse/SPARK-3729 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Michael Armbrust Priority: Blocker {code} java.lang.NullPointerException at org.apache.spark.sql.hive.HiveContext.runHive(HiveContext.scala:307) at org.apache.spark.sql.hive.HiveContext.runSqlHive(HiveContext.scala:270) at org.apache.spark.sql.hive.HiveContext.setConf(HiveContext.scala:242) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:79) at org.apache.spark.sql.SQLContext$$anonfun$1.apply(SQLContext.scala:78) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) at org.apache.spark.sql.SQLContext.init(SQLContext.scala:78) at org.apache.spark.sql.hive.HiveContext.init(HiveContext.scala:76) ... {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3725) Link to building spark returns a 404
[ https://issues.apache.org/jira/browse/SPARK-3725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152138#comment-14152138 ] Sean Owen commented on SPARK-3725: -- No, that links to the raw markdown. Truly, the fix is to rebuild the site. The source is fine. Link to building spark returns a 404 Key: SPARK-3725 URL: https://issues.apache.org/jira/browse/SPARK-3725 Project: Spark Issue Type: Documentation Reporter: Anant Daksh Asthana Priority: Minor Original Estimate: 1m Remaining Estimate: 1m The README.md link to Building Spark returns a 404 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152140#comment-14152140 ] Sean Owen commented on SPARK-3730: -- (The profile is hadoop-2.3 but that's not the issue.) I have seen this too and it's a {{scalac}} bug as far as I can tell, as you can see from the stack trace. It's not a Spark issue. Any one else having building spark recently --- Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152152#comment-14152152 ] Matthew Farrellee commented on SPARK-3685: -- the root of the resource problem is how they're handed out. yarn is giving you a whole cpu, some amount of memory, some amount of network and some amount of disk to work with. your executor (like any program) uses different amounts of resources throughout its execution. at points in the execution the resource profile changes, call the demarcated regions phases. so an executor may transition from a high resource phase to a low resource phase. in a low resource phase, you may want to free up resources for other executors, but maintain enough to do basic operations (say: serve a shuffle file). this is a problem that should be solved by the resource manager. in my opinion, a solution w/i spark that isn't faciliated by the RN is a workaround/hack and should be avoided. an example of a RN facilitated solution might be a message the executor can send to yarn to indicate its resources can be free'd, except for some minimum amount. Spark's local dir should accept only local paths Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2693: Priority: Critical (was: Major) Support for UDAF Hive Aggregates like PERCENTILE Key: SPARK-2693 URL: https://issues.apache.org/jira/browse/SPARK-2693 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala Priority: Critical {code} SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM raw_data_table GROUP BY year, month, day MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an error as shown below. Exception in thread main java.lang.RuntimeException: No handler for udf class org.apache.hadoop.hive.ql.udf.UDAFPercentile at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) {code} This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2693) Support for UDAF Hive Aggregates like PERCENTILE
[ https://issues.apache.org/jira/browse/SPARK-2693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2693: Assignee: Ravindra Pesala Support for UDAF Hive Aggregates like PERCENTILE Key: SPARK-2693 URL: https://issues.apache.org/jira/browse/SPARK-2693 Project: Spark Issue Type: Bug Components: SQL Reporter: Michael Armbrust Assignee: Ravindra Pesala {code} SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM raw_data_table GROUP BY year, month, day MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an error as shown below. Exception in thread main java.lang.RuntimeException: No handler for udf class org.apache.hadoop.hive.ql.udf.UDAFPercentile at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165) {code} This aggregate extends UDAF, which we don't yet have a wrapper for. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3731) RDD caching stops working in pyspark after some time
Milan Straka created SPARK-3731: --- Summary: RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, standalone mode Reporter: Milan Straka Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3731) RDD caching stops working in pyspark after some time
[ https://issues.apache.org/jira/browse/SPARK-3731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milan Straka updated SPARK-3731: Attachment: worker.log Sample worker.log showing the problem. For example, consider rdd_1_1. It has size 46.3MB. At the beginning, the caching work, but that stops -- the last time rdd_1_1 does not fit into cache, the following is reported: {{14/09/29 21:53:10 WARN CacheManager: Not enough space to cache partition rdd_1_1 in memory! Free memory is 148908945 bytes.}} RDD caching stops working in pyspark after some time Key: SPARK-3731 URL: https://issues.apache.org/jira/browse/SPARK-3731 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 1.1.0 Environment: Linux, 32bit, standalone mode Reporter: Milan Straka Attachments: worker.log Consider a file F which when loaded with sc.textFile and cached takes up slightly more than half of free memory for RDD cache. When in PySpark the following is executed: 1) a = sc.textFile(F) 2) a.cache().count() 3) b = sc.textFile(F) 4) b.cache().count() and then the following is repeated (for example 10 times): a) a.unpersist().cache().count() b) b.unpersist().cache().count() after some time, there are no RDD cached in memory. Also, since that time, no other RDD ever gets cached (the worker always reports something like WARN CacheManager: Not enough space to cache partition rdd_23_5 in memory! Free memory is 277478190 bytes., even if rdd_23_5 is ~50MB). The Executors tab of the Application Detail UI shows that all executors have 0MB memory used (which is consistent with the CacheManager warning). When doing the same in scala, everything works perfectly. I understand that this is a vague description, but I do no know how to describe the problem better. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3685) Spark's local dir should accept only local paths
[ https://issues.apache.org/jira/browse/SPARK-3685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152228#comment-14152228 ] Andrew Or commented on SPARK-3685: -- Not sure if I fully understand what you mean. If I'm running an executor and I request 30G from the beginning, my application uses all of it to do computation and all is good. After I decommission the executor, I would like to keep 1G just to serve the shuffle files, but this can't be done easily because we need to start a smaller JVM and a smaller container. (Yarn currently doesn't support scaling the size of a container while it's still running yet). Either way we need to transfer some state from the bigger JVM to the smaller JVM, and that adds some complexity to the design. The simplest alternative then would just to write whatever state to an external location and just terminate the executor JVM / container without starting a smaller one, and then have an external service that is long-running to serve these files. One proposal here then is to write these shuffle files to a special location and have the Yarn NM shuffle service serve the files. This is an alternative to DFS shuffle that is, however, highly specific to Yarn. I am doing some initial prototyping of this (the Yarn shuffle) approach to see how this will pan out. Spark's local dir should accept only local paths Key: SPARK-3685 URL: https://issues.apache.org/jira/browse/SPARK-3685 Project: Spark Issue Type: Bug Components: Spark Core, YARN Affects Versions: 1.1.0 Reporter: Andrew Or When you try to set local dirs to hdfs:/tmp/foo it doesn't work. What it will try to do is create a folder called hdfs: and put tmp inside it. This is because in Util#getOrCreateLocalRootDirs we use java.io.File instead of Hadoop's file system to parse this path. We also need to resolve the path appropriately. This may not have an urgent use case, but it fails silently and does what is least expected. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152238#comment-14152238 ] Reynold Xin commented on SPARK-3709: Adding stack trace {code} [info] - Unpersisting TorrentBroadcast on executors only in distributed mode *** FAILED *** [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 17, localhost): java.io.IOException: sendMessageReliably failed with ACK that signalled a remote error [info] org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:864) [info] org.apache.spark.network.nio.ConnectionManager$$anonfun$14.apply(ConnectionManager.scala:856) [info] org.apache.spark.network.nio.ConnectionManager$MessageStatus.markDone(ConnectionManager.scala:61) [info] org.apache.spark.network.nio.ConnectionManager.org$apache$spark$network$nio$ConnectionManager$$handleMessage(ConnectionManager.scala:655) [info] org.apache.spark.network.nio.ConnectionManager$$anon$10.run(ConnectionManager.scala:515) [info] java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [info] java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [info] java.lang.Thread.run(Thread.java:745) [info] Driver stacktrace: [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1192) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1181) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1180) [info] at scala.coSpark assembly has been built with Hive, including Datanucleus jars on classpath Spark assembly has been built with Hive, including Datanucleus jars on classpath llection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1180) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:695) [info] at scala.Option.foreach(Option.scala:236) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:695) [info] ... {code} BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky Key: SPARK-3709 URL: https://issues.apache.org/jira/browse/SPARK-3709 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
Sotos Matzanas created SPARK-3732: - Summary: Yarn Client: Add option to NOT System.exit() at end of main() Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1547) Add gradient boosting algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152251#comment-14152251 ] Manish Amde commented on SPARK-1547: Sure. I like your naming suggestion. I will rebase from the latest master now that the RF PR has been accepted. I will create a WIP PR soon after (with tests and docs) so that we can discuss the code in greater detail. Add gradient boosting algorithm to MLlib Key: SPARK-1547 URL: https://issues.apache.org/jira/browse/SPARK-1547 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.0.0 Reporter: Manish Amde Assignee: Manish Amde This task requires adding the gradient boosting algorithm to Spark MLlib. The implementation needs to adapt the gradient boosting algorithm to the scalable tree implementation. The tasks involves: - Comparing the various tradeoffs and finalizing the algorithm before implementation - Code implementation - Unit tests - Functional tests - Performance tests - Documentation [Ensembles design document (Google doc) | https://docs.google.com/document/d/1J0Q6OP2Ggx0SOtlPgRUkwLASrAkUJw6m6EK12jRDSNg/] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3730) Any one else having building spark recently
[ https://issues.apache.org/jira/browse/SPARK-3730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152269#comment-14152269 ] Anant Daksh Asthana commented on SPARK-3730: Definately not a spark issue. Just thought some one on here knew a solution. Any one else having building spark recently --- Key: SPARK-3730 URL: https://issues.apache.org/jira/browse/SPARK-3730 Project: Spark Issue Type: Question Reporter: Anant Daksh Asthana Priority: Minor I get an assertion error in spark/core/src/main/scala/org/apache/spark/HttpServer.scala while trying to build. I am building using mvn -Pyarn -PHadoop-2.3 -DskipTests -Phive clean package Here is the error i get http://pastebin.com/Shi43r53 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3717: - Description: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 was: h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations,
[jira] [Commented] (SPARK-3434) Distributed block matrix
[ https://issues.apache.org/jira/browse/SPARK-3434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152275#comment-14152275 ] Reza Zadeh commented on SPARK-3434: --- It looks like Shivaram Venkataraman from the AMPlab has started work on this. I will be meeting with him to see if we can reuse some his work. Distributed block matrix Key: SPARK-3434 URL: https://issues.apache.org/jira/browse/SPARK-3434 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Xiangrui Meng This JIRA is for discussing distributed matrices stored in block sub-matrices. The main challenge is the partitioning scheme to allow adding linear algebra operations in the future, e.g.: 1. matrix multiplication 2. matrix factorization (QR, LU, ...) Let's discuss the partitioning and storage and how they fit into the above use cases. Questions: 1. Should it be backed by a single RDD that contains all of the sub-matrices or many RDDs with each contains only one sub-matrix? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152291#comment-14152291 ] Apache Spark commented on SPARK-3732: - User 'smatzana' has created a pull request for this issue: https://github.com/apache/spark/pull/2584 Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152294#comment-14152294 ] Marcelo Vanzin commented on SPARK-3732: --- I think that explicit System.exit() could just be removed. Exposing an option for this sounds like overkill. [~tgraves] had some comments about that call in the past, though. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152302#comment-14152302 ] Sotos Matzanas commented on SPARK-3732: --- we added the option as insurance against backward compatibility, removing the System.exit() call will obviously work unless somebody is checking the exit code from spark-submit explicitely Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152308#comment-14152308 ] Thomas Graves commented on SPARK-3732: -- I think you should just change the name of this jira to add support for programmatically calling the spark yarn Client. As you have found this isn't currently supported and I wouldn't want to just remove the exit and say its supported without thinking about other implications. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152309#comment-14152309 ] Marcelo Vanzin commented on SPARK-3732: --- Removing the call should work regardless; it's redundant, since the code will just exit normally anyway after that. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152318#comment-14152318 ] Marcelo Vanzin commented on SPARK-3732: --- BTW, if the call is removed, it should be possible to do what you want more generically by calling {{SparkSubmit.main}} directly. That's still a little fishy, but it's a lot less fishy than calling Yarn-specific code directly. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3708) Backticks aren't handled correctly is aliases
[ https://issues.apache.org/jira/browse/SPARK-3708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152321#comment-14152321 ] Ravindra Pesala commented on SPARK-3708: I guess here you mentioned about HiveContext as there is no support of backtick in SqlContext. I will work on this issue.Thank you. Backticks aren't handled correctly is aliases - Key: SPARK-3708 URL: https://issues.apache.org/jira/browse/SPARK-3708 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.1.0 Reporter: Michael Armbrust Here's a failing test case: {code} sql(SELECT k FROM (SELECT `key` AS `k` FROM src) a) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152323#comment-14152323 ] Sotos Matzanas commented on SPARK-3732: --- [~tgraves] this jira is the first step for us to move forward with our own (very limited for now) version of programmatic job submits. I can add another jira to address the big issue, but we would like to see this one resolved first. Once we are confident with our solution we can contribute to the second jira. Let us know what you think Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-3733) Support for programmatically submitting Spark jobs
Sotos Matzanas created SPARK-3733: - Summary: Support for programmatically submitting Spark jobs Key: SPARK-3733 URL: https://issues.apache.org/jira/browse/SPARK-3733 Project: Spark Issue Type: New Feature Affects Versions: 1.1.0 Reporter: Sotos Matzanas Right now it's impossible to programmatically submit Spark jobs via a Scala (or Java) API. We would like to see that in a future version of Spark -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152347#comment-14152347 ] Sung Chung commented on SPARK-3717: --- I think that this would be great as an alternative option. 1. Partitioning by rows (as currently implemented) scales in # of rows. 2. Partitioning by features scales in # of features. With good modularization, I think a lot of tree logic (splitting, building trees) could be shared among the different partitioning schemes. DecisionTree, RandomForest: Partition by feature Key: SPARK-3717 URL: https://issues.apache.org/jira/browse/SPARK-3717 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley h1. Summary Currently, data are partitioned by row/instance for DecisionTree and RandomForest. This JIRA argues for partitioning by feature for training deep trees. This is especially relevant for random forests, which are often trained to be deeper than single decision trees. h1. Details Dataset dimensions and the depth of the tree to be trained are the main problem parameters determining whether it is better to partition features or instances. For random forests (training many deep trees), partitioning features could be much better. Notation: * P = # workers * N = # instances * M = # features * D = depth of tree h2. Partitioning Features Algorithm sketch: * Each worker stores: ** a subset of columns (i.e., a subset of features). If a worker stores feature j, then the worker stores the feature value for all instances (i.e., the whole column). ** all labels * Train one level at a time. * Invariants: ** Each worker stores a mapping: instance → node in current level * On each iteration: ** Each worker: For each node in level, compute (best feature to split, info gain). ** Reduce (P x M) values to M values to find best split for each node. ** Workers who have features used in best splits communicate left/right for relevant instances. Gather total of N bits to master, then broadcast. * Total communication: ** Depth D iterations ** On each iteration, reduce to M values (~8 bytes each), broadcast N values (1 bit each). ** Estimate: D * (M * 8 + N) h2. Partitioning Instances Algorithm sketch: * Train one group of nodes at a time. * Invariants: * Each worker stores a mapping: instance → node * On each iteration: ** Each worker: For each instance, add to aggregate statistics. ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) *** (“# classes” is for classification. 3 for regression) ** Reduce aggregate. ** Master chooses best split for each node in group and broadcasts. * Local training: Once all instances for a node fit on one machine, it can be best to shuffle data and training subtrees locally. This can mean shuffling the entire dataset for each tree trained. * Summing over all iterations, reduce to total of: ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) ** Estimate: 2^D * M * B * C * 8 h2. Comparing Partitioning Methods Partitioning features cost partitioning instances cost when: * D * (M * 8 + N) 2^D * M * B * C * 8 * D * N 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the right hand side) * N [ 2^D * M * B * C * 8 ] / D Example: many instances: * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = 5) * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3630) Identify cause of Kryo+Snappy PARSING_ERROR
[ https://issues.apache.org/jira/browse/SPARK-3630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152351#comment-14152351 ] Arun Ahuja commented on SPARK-3630: --- We have seen this issue as well: {code} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) at com.esotericsoftware.kryo.io.Input.fill(Input.java:142) at com.esotericsoftware.kryo.io.Input.require(Input.java:155) at com.esotericsoftware.kryo.io.Input.readInt(Input.java:337) at com.esotericsoftware.kryo.util.DefaultClassResolver.readClass(DefaultClassResolver.java:109) at com.esotericsoftware.kryo.Kryo.readClass(Kryo.java:610) at com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:721) at org.apache.spark.serializer.KryoDeserializationStream.readObject(KryoSerializer.scala:133) at org.apache.spark.serializer.DeserializationStream$$anon$1.getNext(Serializer.scala:133) at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71) ... Caused by: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:362) at org.xerial.snappy.SnappyInputStream.rawRead(SnappyInputStream.java:159) at org.xerial.snappy.SnappyInputStream.read(SnappyInputStream.java:142) at com.esotericsoftware.kryo.io.Input.fill(Input.java:140) {code} This is with Spark 1.1, running on a Yarn cluster. The issue seems to be fairly frequent but does not happen on every run. Identify cause of Kryo+Snappy PARSING_ERROR --- Key: SPARK-3630 URL: https://issues.apache.org/jira/browse/SPARK-3630 Project: Spark Issue Type: Task Components: Spark Core Affects Versions: 1.1.0 Reporter: Andrew Ash Assignee: Ankur Dave A recent GraphX commit caused non-deterministic exceptions in unit tests so it was reverted (see SPARK-3400). Separately, [~aash] observed the same exception stacktrace in an application-specific Kryo registrator: {noformat} com.esotericsoftware.kryo.KryoException: java.io.IOException: failed to uncompress the chunk: PARSING_ERROR(2) com.esotericsoftware.kryo.io.Input.fill(Input.java:142) com.esotericsoftware.kryo.io.Input.require(Input.java:169) com.esotericsoftware.kryo.io.Input.readInt(Input.java:325) com.esotericsoftware.kryo.io.Input.readFloat(Input.java:624) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:127) com.esotericsoftware.kryo.serializers.DefaultSerializers$FloatSerializer.read(DefaultSerializers.java:117) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:109) com.esotericsoftware.kryo.serializers.CollectionSerializer.read(CollectionSerializer.java:18) com.esotericsoftware.kryo.Kryo.readClassAndObject(Kryo.java:732) ... {noformat} This ticket is to identify the cause of the exception in the GraphX commit so the faulty commit can be fixed and merged back into master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-3709: --- Assignee: Reynold Xin (was: Cheng Lian) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky Key: SPARK-3709 URL: https://issues.apache.org/jira/browse/SPARK-3709 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3709) BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky
[ https://issues.apache.org/jira/browse/SPARK-3709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152377#comment-14152377 ] Apache Spark commented on SPARK-3709: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/2585 BroadcastSuite.Unpersisting rg.apache.spark.broadcast.BroadcastSuite.Unpersisting TorrentBroadcast is flaky Key: SPARK-3709 URL: https://issues.apache.org/jira/browse/SPARK-3709 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Reynold Xin Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3732) Yarn Client: Add option to NOT System.exit() at end of main()
[ https://issues.apache.org/jira/browse/SPARK-3732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14152409#comment-14152409 ] Thomas Graves commented on SPARK-3732: -- I understand your usecase and need for it, but I think at this point I don't think we want to say we support it without properly addressing the bigger picture. The only public supported non-deprecated api is via spark-submit script. This means that we won't guarantee backwards compatibility on it. Note that we specifically discussed the Client beyond public on another PR and it was decided that it isn't officially supported. The only reason the object was left public was for backwards compatibility with how you used to start spark on yarn with the spark-class script. Yarn Client: Add option to NOT System.exit() at end of main() - Key: SPARK-3732 URL: https://issues.apache.org/jira/browse/SPARK-3732 Project: Spark Issue Type: Improvement Affects Versions: 1.1.0 Reporter: Sotos Matzanas Original Estimate: 1h Remaining Estimate: 1h We would like to add the ability to create and submit Spark jobs programmatically via Scala/Java. We have found a way to hack this and submit jobs via Yarn, but since org.apache.spark.deploy.yarn.Client.main() exits with either 0 or 1 in the end, this will mean exit of our own program. We would like to add an optional spark conf param to NOT exit at the end of the main -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org