[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278381#comment-14278381 ] Pedro Rodriguez commented on SPARK-1405: Worked on some preliminary testing results, but ran into a snag which greatly limited what I could test. Used 4 EC2 r3.2xlarge instances, which totals 32 executors with 215GB memory, so fairly close to tests run by [~gq]. I have been using the data generator I wrote, but had not previously stress tested it for a large number of topics, and it is failing for some number of topics between 100-1000. I made some optimizations with mapPartitions, but still run into GC overhead problems. I am unsure if this is my fault or if somewhere breeze is generating lots of garbage when sampling from the Dirichlet/Multinomial distributions. Extra eyes would be great to see if it looks reasonable from GC perspective: https://github.com/EntilZha/spark/blob/LDA/mllib/src/main/scala/org/apache/spark/mllib/util/LDADataGenerator.scala Alternatively, [~gq], would it be possible to either get the data set you used for testing or are you using a different dataset for testing [~josephkb]? It would be useful to have a common way to compare implementation performance Here are the test results I did get: Docs: 250,000 Words: 75,000 Tokens: 30,000,000 Iterations: 15 Topics: 100 Alpha=Beta=.01 Setup Time: 20s Resampling Time: 80s Updating Counts Time: 53s Global Counts Time: 4s Total Time: 170s (2.83m) This looks like a good improvement over the original numbers (red blue plot) extrapolation that this ran for 10x less iterations, it would be 28m vs ~45m. I am fairly confident that this time will scale fairly well with number of topics based on the previous test results I posted. I would be more than happy to run more benchmark tests if I can get access to the data set used for the other tests or what Joseph is using to test his PR. I am also going to start working on refactoring into Joseph's API, will open a PR once that is done, probably later this week. It would be great to have both Gibbs and EM for next release. > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Derrick Burns updated SPARK-5186: - Summary: Vector.equals and Vector.hashCode are very inefficient and fail on SparseVectors with large size (was: Vector.equals and Vector.hashCode are very inefficient) > Vector.equals and Vector.hashCode are very inefficient and fail on > SparseVectors with large size > - > > Key: SPARK-5186 > URL: https://issues.apache.org/jira/browse/SPARK-5186 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > The implementation of Vector.equals and Vector.hashCode are correct but slow > for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5186) Vector.equals and Vector.hashCode are very inefficient
[ https://issues.apache.org/jira/browse/SPARK-5186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278374#comment-14278374 ] Derrick Burns commented on SPARK-5186: -- The aforementioned pull request does fix part of the problem, however, hashCode is broken, not just inefficient. If one calls hashCode on a SparseVector with large size, one will get an insufficient memory error. > Vector.equals and Vector.hashCode are very inefficient > --- > > Key: SPARK-5186 > URL: https://issues.apache.org/jira/browse/SPARK-5186 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Derrick Burns > Original Estimate: 0.25h > Remaining Estimate: 0.25h > > The implementation of Vector.equals and Vector.hashCode are correct but slow > for SparseVectors that are truly sparse. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278305#comment-14278305 ] Muhammad-Ali A'rabi commented on SPARK-5226: Yeah, of course. It will take me a day or two, so I commented to say that I'm working on it. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5261) In some cases ,The value of word's vector representation is too big
Guoqiang Li created SPARK-5261: -- Summary: In some cases ,The value of word's vector representation is too big Key: SPARK-5261 URL: https://issues.apache.org/jira/browse/SPARK-5261 Project: Spark Issue Type: Bug Components: MLlib Affects Versions: 1.2.0 Reporter: Guoqiang Li {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(36) {code} The average absolute value of the word's vector representation is 60731.8 {code} val word2Vec = new Word2Vec() word2Vec. setVectorSize(100). setSeed(42L). setNumIterations(5). setNumPartitions(1) {code} The average absolute value of the word's vector representation is 0.13889 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278249#comment-14278249 ] Apache Spark commented on SPARK-5193: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4056 > Make Spark SQL API usable in Java and remove the Java-specific API > -- > > Key: SPARK-5193 > URL: https://issues.apache.org/jira/browse/SPARK-5193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Java version of the SchemaRDD API causes high maintenance burden for Spark > SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support > both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it > usable for Java, and then we can remove the Java specific version. > Things to remove include (Java version of): > - data type > - Row > - SQLContext > - HiveContext > Things to consider: > - Scala and Java have a different collection library. > - Scala and Java (8) have different closure interface. > - Scala and Java can have duplicate definitions of common classes, such as > BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5247) Enable javadoc/scaladoc for public classes in catalyst project
[ https://issues.apache.org/jira/browse/SPARK-5247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5247: --- Assignee: Michael Armbrust > Enable javadoc/scaladoc for public classes in catalyst project > -- > > Key: SPARK-5247 > URL: https://issues.apache.org/jira/browse/SPARK-5247 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > We previously did not generate any docs for the entire catalyst project. > Since now we are defining public APIs in that (under org.apache.spark.sql > outside of org.apache.spark.sql.catalyst, such as Row, types._), we should > start generating javadoc/scaladoc for those. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
[ https://issues.apache.org/jira/browse/SPARK-5260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Corey J. Nolet updated SPARK-5260: -- Description: I have found this method extremely useful when implementing my own strategy for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. (was: I have found this method extremely useful when implementing my own method for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema.) > Expose JsonRDD.allKeysWithValueTypes() in a utility class > -- > > Key: SPARK-5260 > URL: https://issues.apache.org/jira/browse/SPARK-5260 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Corey J. Nolet > Fix For: 1.3.0 > > > I have found this method extremely useful when implementing my own strategy > for inferring a schema from parsed json. For now, I've actually copied the > method right out of the JsonRDD class into my own project but I think it > would be immensely useful to keep the code in Spark and expose it publicly > somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5260) Expose JsonRDD.allKeysWithValueTypes() in a utility class
Corey J. Nolet created SPARK-5260: - Summary: Expose JsonRDD.allKeysWithValueTypes() in a utility class Key: SPARK-5260 URL: https://issues.apache.org/jira/browse/SPARK-5260 Project: Spark Issue Type: Improvement Components: SQL Reporter: Corey J. Nolet Fix For: 1.3.0 I have found this method extremely useful when implementing my own method for inferring a schema from parsed json. For now, I've actually copied the method right out of the JsonRDD class into my own project but I think it would be immensely useful to keep the code in Spark and expose it publicly somewhere else- like an object called JsonSchema. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228 ] RJ Nowling edited comment on SPARK-4894 at 1/15/15 4:21 AM: [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. Can we create a new JIRA to discuss the NB refactoring? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. was (Author: rnowling): [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. What do you think? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278228#comment-14278228 ] RJ Nowling commented on SPARK-4894: --- [~josephkb], after some thought, I've come around and think your idea of 1 NB class with a Factor type parameter may be the more maintainable choice as well as offering some novel functionality. But, there seems to be a lot to figure out (we should be checking the decision tree implementation for example) and I don't want to hold up what should be a relatively simple change to support Bernoulli NB. What do you think? Comments about refactoring: (1) how often is NB used with continuous values? I see that sklearn supports Gaussian NB but is this used in practice? My understanding is that NB is generally used for text classification with counts or binary values, possibly weighted by TF-IDF. We should probably email the users and dev lists to get user feedback. If no one is asking for it, we should shelve it and focus on other things. (2) after some more reflection, I can see a few more benefits to your suggestions of feature types (e.g., categorial, discrete counts, continuous, binary, etc.). If we created corresponding FeatureLikelihood types (e.g., Bernoulli, Multinomial, Gaussian, etc.), it would promote composition which would be easier to test, debug, and maintain versus multiple NB subclasses like sklearn. Additionally, if the user can define a type for each feature, then users can mix and match likelihood types as well. Most NB implementations treat all features the same -- what if we had a model that allowed heterozygous features? If it works well in NB, it could be extended to other parts of MLlib. (There is likely some overlap with decision trees since they support multiple feature types, so we might want to see if there is anything there we can reuse.) At the API level, we could provide a basic API which takes {noformat}RDD[Vector[Double]]{noformat} like the current API so that simplicity isn't compromised and provide a more advanced API for power users. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
[ https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-5259: - Description: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet but covered the partitions. then stage is Available is true. {code} def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } {code} but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) runEvent(ExecutorLost("exec-hostA")) runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"), null, null, null, null)) scheduler.resubmitFailedStages() runEvent(CompletionEvent(taskSets(1).tasks(0), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(4), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(5), Success, makeMapStatus("hostB", 2), null, null, null)) val stage = scheduler.stageIdToStage(taskSets(1).stageId) assert(stage.attemptId == 2) assert(stage.isAvailable) assert(stage.pendingTasks.size == 0) } {code} was: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. {code} def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } {code} but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success,
[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
[ https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-5259: - Description: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. {code} def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } {code} but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) runEvent(ExecutorLost("exec-hostA")) runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"), null, null, null, null)) scheduler.resubmitFailedStages() runEvent(CompletionEvent(taskSets(1).tasks(0), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(4), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(5), Success, makeMapStatus("hostB", 2), null, null, null)) val stage = scheduler.stageIdToStage(taskSets(1).stageId) assert(stage.attemptId == 2) assert(stage.isAvailable) assert(stage.pendingTasks.size == 0) } {code} was: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) ru
[jira] [Updated] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
[ https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] SuYan updated SPARK-5259: - Description: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: {code} test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) runEvent(ExecutorLost("exec-hostA")) runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"), null, null, null, null)) scheduler.resubmitFailedStages() runEvent(CompletionEvent(taskSets(1).tasks(0), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(4), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(5), Success, makeMapStatus("hostB", 2), null, null, null)) val stage = scheduler.stageIdToStage(taskSets(1).stageId) assert(stage.attemptId == 2) assert(stage.isAvailable) assert(stage.pendingTasks.size == 0) } {code} was: 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) runEvent(ExecutorLos
[jira] [Commented] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
[ https://issues.apache.org/jira/browse/SPARK-5259?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278222#comment-14278222 ] Apache Spark commented on SPARK-5259: - User 'suyanNone' has created a pull request for this issue: https://github.com/apache/spark/pull/4055 > Add task equal() and hashcode() to avoid stage.pendingTasks not accurate > while stage was retry > --- > > Key: SPARK-5259 > URL: https://issues.apache.org/jira/browse/SPARK-5259 > Project: Spark > Issue Type: Bug >Affects Versions: 1.1.1, 1.2.0 >Reporter: SuYan > Fix For: 1.2.0 > > > 1. while shuffle stage was retry, there may have 2 taskSet running. > we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will > re-run taskSet0.0's un-complete task > if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. > then stage is Available is true. > def isAvailable: Boolean = { > if (!isShuffleMap) { > true > } else { > numAvailableOutputs == numPartitions > } > } > but stage.pending task is not empty, to protect register mapStatus in > mapOutputTracker. > because if task is complete success, pendingTasks is minus Task in > reference-level because the task is not override hashcode() and equals() > pendingTask -= task > but numAvailableOutputs is according to partitionID. > here is the testcase to prove: > test("Make sure mapStage.pendingtasks is set() " + > "while MapStage.isAvailable is true while stage was retry ") { > val firstRDD = new MyRDD(sc, 6, Nil) > val firstShuffleDep = new ShuffleDependency(firstRDD, null) > val firstShuyffleId = firstShuffleDep.shuffleId > val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) > val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) > val shuffleId = shuffleDep.shuffleId > val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) > submit(reduceRdd, Array(0, 1)) > complete(taskSets(0), Seq( > (Success, makeMapStatus("hostB", 1)), > (Success, makeMapStatus("hostB", 2)), > (Success, makeMapStatus("hostC", 3)), > (Success, makeMapStatus("hostB", 4)), > (Success, makeMapStatus("hostB", 5)), > (Success, makeMapStatus("hostC", 6)) > )) > complete(taskSets(1), Seq( > (Success, makeMapStatus("hostA", 1)), > (Success, makeMapStatus("hostB", 2)), > (Success, makeMapStatus("hostA", 1)), > (Success, makeMapStatus("hostB", 2)), > (Success, makeMapStatus("hostA", 1)) > )) > runEvent(ExecutorLost("exec-hostA")) > runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, > null, null)) > runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, > null, null)) > runEvent(CompletionEvent(taskSets(1).tasks(0), > FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"), > null, null, null, null)) > scheduler.resubmitFailedStages() > runEvent(CompletionEvent(taskSets(1).tasks(0), Success, > makeMapStatus("hostC", 1), null, null, null)) > runEvent(CompletionEvent(taskSets(1).tasks(2), Success, > makeMapStatus("hostC", 1), null, null, null)) > runEvent(CompletionEvent(taskSets(1).tasks(4), Success, > makeMapStatus("hostC", 1), null, null, null)) > runEvent(CompletionEvent(taskSets(1).tasks(5), Success, > makeMapStatus("hostB", 2), null, null, null)) > val stage = scheduler.stageIdToStage(taskSets(1).stageId) > assert(stage.attemptId == 2) > assert(stage.isAvailable) > assert(stage.pendingTasks.size == 0) > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5259) Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry
SuYan created SPARK-5259: Summary: Add task equal() and hashcode() to avoid stage.pendingTasks not accurate while stage was retry Key: SPARK-5259 URL: https://issues.apache.org/jira/browse/SPARK-5259 Project: Spark Issue Type: Bug Affects Versions: 1.2.0, 1.1.1 Reporter: SuYan Fix For: 1.2.0 1. while shuffle stage was retry, there may have 2 taskSet running. we call the 2 taskSet:taskSet0.0, taskSet0.1, and we know, taskSet0.1 will re-run taskSet0.0's un-complete task if taskSet0.0 was run all the task that the taskSet0.1 not complete yet. then stage is Available is true. def isAvailable: Boolean = { if (!isShuffleMap) { true } else { numAvailableOutputs == numPartitions } } but stage.pending task is not empty, to protect register mapStatus in mapOutputTracker. because if task is complete success, pendingTasks is minus Task in reference-level because the task is not override hashcode() and equals() pendingTask -= task but numAvailableOutputs is according to partitionID. here is the testcase to prove: test("Make sure mapStage.pendingtasks is set() " + "while MapStage.isAvailable is true while stage was retry ") { val firstRDD = new MyRDD(sc, 6, Nil) val firstShuffleDep = new ShuffleDependency(firstRDD, null) val firstShuyffleId = firstShuffleDep.shuffleId val shuffleMapRdd = new MyRDD(sc, 6, List(firstShuffleDep)) val shuffleDep = new ShuffleDependency(shuffleMapRdd, null) val shuffleId = shuffleDep.shuffleId val reduceRdd = new MyRDD(sc, 2, List(shuffleDep)) submit(reduceRdd, Array(0, 1)) complete(taskSets(0), Seq( (Success, makeMapStatus("hostB", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostC", 3)), (Success, makeMapStatus("hostB", 4)), (Success, makeMapStatus("hostB", 5)), (Success, makeMapStatus("hostC", 6)) )) complete(taskSets(1), Seq( (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)), (Success, makeMapStatus("hostB", 2)), (Success, makeMapStatus("hostA", 1)) )) runEvent(ExecutorLost("exec-hostA")) runEvent(CompletionEvent(taskSets(1).tasks(0), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Resubmitted, null, null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(0), FetchFailed(null, firstShuyffleId, -1, 0, "Fetch Mata data failed"), null, null, null, null)) scheduler.resubmitFailedStages() runEvent(CompletionEvent(taskSets(1).tasks(0), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(2), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(4), Success, makeMapStatus("hostC", 1), null, null, null)) runEvent(CompletionEvent(taskSets(1).tasks(5), Success, makeMapStatus("hostB", 2), null, null, null)) val stage = scheduler.stageIdToStage(taskSets(1).stageId) assert(stage.attemptId == 2) assert(stage.isAvailable) assert(stage.pendingTasks.size == 0) } -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278163#comment-14278163 ] Apache Spark commented on SPARK-5193: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4054 > Make Spark SQL API usable in Java and remove the Java-specific API > -- > > Key: SPARK-5193 > URL: https://issues.apache.org/jira/browse/SPARK-5193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Java version of the SchemaRDD API causes high maintenance burden for Spark > SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support > both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it > usable for Java, and then we can remove the Java specific version. > Things to remove include (Java version of): > - data type > - Row > - SQLContext > - HiveContext > Things to consider: > - Scala and Java have a different collection library. > - Scala and Java (8) have different closure interface. > - Scala and Java can have duplicate definitions of common classes, such as > BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated
[ https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278160#comment-14278160 ] Apache Spark commented on SPARK-5254: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4053 > Update the user guide to make clear that spark.mllib is not being deprecated > > > Key: SPARK-5254 > URL: https://issues.apache.org/jira/browse/SPARK-5254 > Project: Spark > Issue Type: Documentation >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.3.0, 1.2.1 > > > The current statement in the user guide may deliver confusing messages to > users. spark.ml contains high-level APIs for building ML pipelines. But it > doesn't mean that spark.mllib is being deprecated. > First of all, the pipeline API is in its alpha stage and we need to see more > use cases from the community to stabilizes it, which may take several > releases. Secondly, the components in spark.ml are simple wrappers over > spark.mllib implementations. Neither the APIs or the implementations from > spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs > to build their ML pipelines, but we will keep supporting and adding features > to spark.mllib. For example, there are many features in review at > https://spark-prs.appspot.com/#mllib. So users should be comfortable with > using spark.mllib features and expect more coming. The user guide needs to be > updated to make the message clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5147) write ahead logs from streaming receiver are not purged because cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called
[ https://issues.apache.org/jira/browse/SPARK-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14278161#comment-14278161 ] Saisai Shao commented on SPARK-5147: 1. Currently detecting whether to delete the WAL is at clearMetatdata(), which is checkpointing and after the batch is finished. 2. Actually this is point what I'm thinking of and can be improved possibly. The throughput what I mean is that 2 copies of BM and 3 copies of HDFS will occupy the network bandwidth more than of 1 copies of BM, 3 copies of HDFS, not the response time. Yes the HDFS replication is much slower than BM replication. But concurrently both replicating to BM and HDFS will increase the network bandwidth contention and lower the throughput of whole system. > write ahead logs from streaming receiver are not purged because > cleanupOldBlocks in WriteAheadLogBasedBlockHandler is never called > -- > > Key: SPARK-5147 > URL: https://issues.apache.org/jira/browse/SPARK-5147 > Project: Spark > Issue Type: Sub-task > Components: Streaming >Affects Versions: 1.2.0 >Reporter: Max Xu >Priority: Blocker > > Hi all, > We are running a Spark streaming application with ReliableKafkaReceiver. We > have "spark.streaming.receiver.writeAheadLog.enable" set to true so write > ahead logs (WALs) for received data are created under receivedData/streamId > folder in the checkpoint directory. > However, old WALs are never purged by time. receivedBlockMetadata and > checkpoint files are purged correctly though. I went through the code, > WriteAheadLogBasedBlockHandler class in ReceivedBlockHandler.scala is > responsible for cleaning up the old blocks. It has method cleanupOldBlocks, > which is never called by any class. ReceiverSupervisorImpl class holds a > WriteAheadLogBasedBlockHandler instance. However, it only calls storeBlock > method to create WALs but never calls cleanupOldBlocks method to purge old > WALs. > The size of the WAL folder increases constantly on HDFS. This is preventing > us from running the ReliableKafkaReceiver 24x7. Can somebody please take a > look. > Thanks, > Max -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5258) Clean up exposed classes in sql.hive package
Reynold Xin created SPARK-5258: -- Summary: Clean up exposed classes in sql.hive package Key: SPARK-5258 URL: https://issues.apache.org/jira/browse/SPARK-5258 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Reynold Xin -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5257) SparseVector indices must be non-negative
Derrick Burns created SPARK-5257: Summary: SparseVector indices must be non-negative Key: SPARK-5257 URL: https://issues.apache.org/jira/browse/SPARK-5257 Project: Spark Issue Type: Documentation Components: MLlib Affects Versions: 1.2.0 Reporter: Derrick Burns Priority: Minor The description of SparseVector suggests only that the indices have to be distinct integers. However the code for the constructor that takes an array of (index, value) tuples assumes that the indices are non-negative. Either the code must be changed or the description should be changed. This arose when I generated indices via hashing and converting the hash values to (signed) integers -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated
[ https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5254. -- Resolution: Fixed Fix Version/s: 1.2.1 1.3.0 Issue resolved by pull request 4052 [https://github.com/apache/spark/pull/4052] > Update the user guide to make clear that spark.mllib is not being deprecated > > > Key: SPARK-5254 > URL: https://issues.apache.org/jira/browse/SPARK-5254 > Project: Spark > Issue Type: Documentation >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 1.3.0, 1.2.1 > > > The current statement in the user guide may deliver confusing messages to > users. spark.ml contains high-level APIs for building ML pipelines. But it > doesn't mean that spark.mllib is being deprecated. > First of all, the pipeline API is in its alpha stage and we need to see more > use cases from the community to stabilizes it, which may take several > releases. Secondly, the components in spark.ml are simple wrappers over > spark.mllib implementations. Neither the APIs or the implementations from > spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs > to build their ML pipelines, but we will keep supporting and adding features > to spark.mllib. For example, there are many features in review at > https://spark-prs.appspot.com/#mllib. So users should be comfortable with > using spark.mllib features and expect more coming. The user guide needs to be > updated to make the message clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277988#comment-14277988 ] Alexander Ulanov commented on SPARK-5256: - Also, asynchronous gradient update might be a good thing to have. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277986#comment-14277986 ] Alexander Ulanov commented on SPARK-5256: - I would like to improve Gradient interface, so it will be able to process something more general than `Label` (which is relevant only to classifiers but not to other machine learning methods) and also allowing batch processing. The simplest way for me of doing this is to add another function to `Gradient` interface: def compute(data: Vector, output: Vector, weights: Vector, cumGradient: Vector): Double In `Gradient` trait it should call `compute` with `label`. Of course, one needs to make some adjustments to LBFGS and GradientDescent optimizers, replacing label: double with output:vector. For batch processing one can put data and output points stacked into a long vector (matrices are stored in this way in breeze) and pass them with the proposed interface. > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277932#comment-14277932 ] Xiangrui Meng commented on SPARK-5226: -- [~angellandros] Before you start coding, could you describe the computation and communication cost of a parallel implementation of DBSCAN as well as the proposed APIs (parameters, methods)? It would help us decide whether to include it in MLlib. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277933#comment-14277933 ] Joseph K. Bradley commented on SPARK-4894: -- +1 for small changes, but occasionally larger ones are needed. The jump to the spark.ml package might be a good time to make such a switch. We don't want to try to approach the complexity of Factorie, but it should not be too hard to support continuous variables. Gaussian NB would surely be the first generalization to add. I agree we should think about the API carefully to encourage people to use NB correctly. The items to decide are: * Should we only support discrete variables? If so, then there is nothing to discuss. * If we support continuous variables (and I would argue we should), then we must design a good API. No matter what API we design, people may misuse it by treating a categorical feature as continuous. Documentation is necessary and helpful, but a good API is important too. That API could take several forms: ** 1 NaiveBayes class, with a Factor type parameter ** multiple NaiveBayes classes for different variable types and/or distributions ** a better way of specifying types, such as enumerations for discrete variables Basically, to support continuous features, we will need to think about types, and I agree it should be thought out so it does not become too complex. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5226: - Affects Version/s: (was: 1.2.0) > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5226: - Target Version/s: (was: 1.2.0) > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3655) Support sorting of values in addition to keys (i.e. secondary sort)
[ https://issues.apache.org/jira/browse/SPARK-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sandy Ryza updated SPARK-3655: -- Priority: Major (was: Minor) > Support sorting of values in addition to keys (i.e. secondary sort) > --- > > Key: SPARK-3655 > URL: https://issues.apache.org/jira/browse/SPARK-3655 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Affects Versions: 1.1.0, 1.2.0 >Reporter: koert kuipers >Assignee: Koert Kuipers > > Now that spark has a sort based shuffle, can we expect a secondary sort soon? > There are some use cases where getting a sorted iterator of values per key is > helpful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5256: - Comment: was deleted (was: Generalization: grouped optimization) > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5256: - Comment: was deleted (was: Generalization: grouped optimization) > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5256: - Comment: was deleted (was: Improving "Updater" concept) > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277861#comment-14277861 ] Joseph K. Bradley commented on SPARK-5256: -- Generalization: grouped optimization > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277858#comment-14277858 ] Joseph K. Bradley commented on SPARK-5256: -- Generalization: grouped optimization > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5256) Improving MLlib optimization APIs
[ https://issues.apache.org/jira/browse/SPARK-5256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277857#comment-14277857 ] Joseph K. Bradley commented on SPARK-5256: -- Improving "Updater" concept > Improving MLlib optimization APIs > - > > Key: SPARK-5256 > URL: https://issues.apache.org/jira/browse/SPARK-5256 > Project: Spark > Issue Type: Umbrella > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley > > *Goal*: Improve APIs for optimization > *Motivation*: There have been several disjoint mentions of improving the > optimization APIs to make them more pluggable, extensible, etc. This JIRA is > a place to discuss what API changes are necessary for the long term, and to > provide links to other relevant JIRAs. > Eventually, I hope this leads to a design doc outlining: > * current issues > * requirements such as supporting many types of objective functions, > optimization algorithms, and parameters to those algorithms > * ideal API > * breakdown of smaller JIRAs needed to achieve that API > I will soon create an initial design doc, and I will try to watch this JIRA > and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5256) Improving MLlib optimization APIs
Joseph K. Bradley created SPARK-5256: Summary: Improving MLlib optimization APIs Key: SPARK-5256 URL: https://issues.apache.org/jira/browse/SPARK-5256 Project: Spark Issue Type: Umbrella Components: MLlib Affects Versions: 1.2.0 Reporter: Joseph K. Bradley *Goal*: Improve APIs for optimization *Motivation*: There have been several disjoint mentions of improving the optimization APIs to make them more pluggable, extensible, etc. This JIRA is a place to discuss what API changes are necessary for the long term, and to provide links to other relevant JIRAs. Eventually, I hope this leads to a design doc outlining: * current issues * requirements such as supporting many types of objective functions, optimization algorithms, and parameters to those algorithms * ideal API * breakdown of smaller JIRAs needed to achieve that API I will soon create an initial design doc, and I will try to watch this JIRA and include ideas from JIRA comments. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4906) Spark master OOMs with exception stack trace stored in JobProgressListener
[ https://issues.apache.org/jira/browse/SPARK-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277849#comment-14277849 ] Mingyu Kim commented on SPARK-4906: --- "typically once a few tasks have failed the stage will fail": Is that actually true? Doesn't a stage with N tasks run until all N tasks run (and potentially fail)? We have datasets that are thousands of partitions each, so that can totally add up to 50k failed tasks over time. We can certainly allocate more memory, but the concern is that for long-running SparkContexts, this memory requirement will grow linearly over time, assuming the failure happens at a fixed frequency. So, we will run into OOMs no matter how large the heap is as long as we have an enough number of failures over time. > Spark master OOMs with exception stack trace stored in JobProgressListener > -- > > Key: SPARK-4906 > URL: https://issues.apache.org/jira/browse/SPARK-4906 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.1.1 >Reporter: Mingyu Kim > > Spark master was OOMing with a lot of stack traces retained in > JobProgressListener. The object dependency goes like the following. > JobProgressListener.stageIdToData => StageUIData.taskData => > TaskUIData.errorMessage > Each error message is ~10kb since it has the entire stack trace. As we have a > lot of tasks, when all of the tasks across multiple stages go bad, these > error messages accounted for 0.5GB of heap at some point. > Please correct me if I'm wrong, but it looks like all the task info for > running applications are kept in memory, which means it's almost always bound > to OOM for long-running applications. Would it make sense to fix this, for > example, by spilling some UI states to disk? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1420#comment-1420 ] RJ Nowling commented on SPARK-4894: --- Hi [~josephkb], lots to think about! In general, I'm a big fan of multiple small changes over time rather than one big change. They're easier to verify and review. Since MLLib is going through an interface refactoring to become ML anyway, we can focus on the Bernoulli NB change now and worry about a redesign of the API later. What do you have in mind for other feature and label types? I briefly reviewed Factorie -- their concept of Factors may be over complicated for Naive Bayes but I want to learn more about your ideas. Do you have a few concrete examples of how Factors could be used with NB? And for continuous labels, are you thinking of something like the Gaussian NB in sklearn? >From bioinformatics, I know that folks tend to encode categorical variables >incorrectly. E.g., for a DNA sequence consisting of A, T, C, G, and possibly >gaps, each position in a sequence should be encoded as four (five) features, >one for each nucleotide. When folks try to represent each position as one >feature with the bases as numbers (A=1, T=2, etc.), this results in incorrect >distance metrics. E.g., ATT will differ from TTT by 1 but ATT will differ from >CTT by 2. By using one feature for each of the four (five) possibilities, you >get correct distances and can even weight mutations and deletions using BLOSUM >matrices and such. For this type of case, I think the solution there is >education and documentation, not complicated type systems. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated
[ https://issues.apache.org/jira/browse/SPARK-5254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277768#comment-14277768 ] Apache Spark commented on SPARK-5254: - User 'mengxr' has created a pull request for this issue: https://github.com/apache/spark/pull/4052 > Update the user guide to make clear that spark.mllib is not being deprecated > > > Key: SPARK-5254 > URL: https://issues.apache.org/jira/browse/SPARK-5254 > Project: Spark > Issue Type: Documentation >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > The current statement in the user guide may deliver confusing messages to > users. spark.ml contains high-level APIs for building ML pipelines. But it > doesn't mean that spark.mllib is being deprecated. > First of all, the pipeline API is in its alpha stage and we need to see more > use cases from the community to stabilizes it, which may take several > releases. Secondly, the components in spark.ml are simple wrappers over > spark.mllib implementations. Neither the APIs or the implementations from > spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs > to build their ML pipelines, but we will keep supporting and adding features > to spark.mllib. For example, there are many features in review at > https://spark-prs.appspot.com/#mllib. So users should be comfortable with > using spark.mllib features and expect more coming. The user guide needs to be > updated to make the message clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5255) Use python doc "note" for experimental tags in tree.py
[ https://issues.apache.org/jira/browse/SPARK-5255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-5255: - Description: spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in the docs. Those should use: {code} .. note:: Experimental {code} (as in, e.g., feature.py) was: spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in the docs. Those should use: ``` .. note:: Experimental ``` (as in, e.g., feature.py) > Use python doc "note" for experimental tags in tree.py > -- > > Key: SPARK-5255 > URL: https://issues.apache.org/jira/browse/SPARK-5255 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Trivial > > spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags > in the docs. Those should use: > {code} > .. note:: Experimental > {code} > (as in, e.g., feature.py) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5255) Use python doc "note" for experimental tags in tree.py
Joseph K. Bradley created SPARK-5255: Summary: Use python doc "note" for experimental tags in tree.py Key: SPARK-5255 URL: https://issues.apache.org/jira/browse/SPARK-5255 Project: Spark Issue Type: Improvement Components: MLlib, PySpark Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Priority: Trivial spark/python/pyspark/mllib/tree.py currently has several "EXPERIMENTAL" tags in the docs. Those should use: ``` .. note:: Experimental ``` (as in, e.g., feature.py) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4585) Spark dynamic executor allocation shouldn't use maxExecutors as initial number
[ https://issues.apache.org/jira/browse/SPARK-4585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277716#comment-14277716 ] Apache Spark commented on SPARK-4585: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/4051 > Spark dynamic executor allocation shouldn't use maxExecutors as initial number > -- > > Key: SPARK-4585 > URL: https://issues.apache.org/jira/browse/SPARK-4585 > Project: Spark > Issue Type: Improvement > Components: Spark Core, YARN >Affects Versions: 1.1.0 >Reporter: Chengxiang Li > > With SPARK-3174, one can configure a minimum and maximum number of executors > for a Spark application on Yarn. However, the application always starts with > the maximum. It seems more reasonable, at least for Hive on Spark, to start > from the minimum and scale up as needed up to the maximum. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5254) Update the user guide to make clear that spark.mllib is not being deprecated
Xiangrui Meng created SPARK-5254: Summary: Update the user guide to make clear that spark.mllib is not being deprecated Key: SPARK-5254 URL: https://issues.apache.org/jira/browse/SPARK-5254 Project: Spark Issue Type: Documentation Reporter: Xiangrui Meng Assignee: Xiangrui Meng The current statement in the user guide may deliver confusing messages to users. spark.ml contains high-level APIs for building ML pipelines. But it doesn't mean that spark.mllib is being deprecated. First of all, the pipeline API is in its alpha stage and we need to see more use cases from the community to stabilizes it, which may take several releases. Secondly, the components in spark.ml are simple wrappers over spark.mllib implementations. Neither the APIs or the implementations from spark.mllib are being deprecated. We expect users use spark.ml pipeline APIs to build their ML pipelines, but we will keep supporting and adding features to spark.mllib. For example, there are many features in review at https://spark-prs.appspot.com/#mllib. So users should be comfortable with using spark.mllib features and expect more coming. The user guide needs to be updated to make the message clear. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3726: - Assignee: Manoj Kumar > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5199) Input metrics should show up for InputFormats that return CombineFileSplits
[ https://issues.apache.org/jira/browse/SPARK-5199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277704#comment-14277704 ] Apache Spark commented on SPARK-5199: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/4050 > Input metrics should show up for InputFormats that return CombineFileSplits > --- > > Key: SPARK-5199 > URL: https://issues.apache.org/jira/browse/SPARK-5199 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Sandy Ryza >Assignee: Sandy Ryza > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277702#comment-14277702 ] Joseph K. Bradley commented on SPARK-4894: -- [~rnowling] Thanks for looking into this issue! I was thinking about 2 possibilities for generalizing NaiveBayes: * Specify the model type with simple strings, and keep current API ** This is simpler and maintains API stability. * Generalize the model to allow other feature and label types ** This is what we really should do long-term since NaiveBayes should not be limited to discrete labels. ** This would require more work, including defining a Factor concept (using the terminology from Probabilistic Graphical Models) and updating the NaiveBayes API to use factors. Different factors would handle discrete and/or continuous variables, and would encode different types of distributions. This setup is common in graphical model libraries like [Factorie| http://factorie.cs.umass.edu/index.html]. ** Alternative: We could separate NaiveBayes into 2 classes based on discrete and continuous labels, but that might be even more work in the long term (to maintain 2 copies of the API). What do you think? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5253) LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge
DB Tsai created SPARK-5253: -- Summary: LinearRegression with L1/L2 (elastic net) using OWLQN in new ML pacakge Key: SPARK-5253 URL: https://issues.apache.org/jira/browse/SPARK-5253 Project: Spark Issue Type: New Feature Components: ML Reporter: DB Tsai -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5193) Make Spark SQL API usable in Java and remove the Java-specific API
[ https://issues.apache.org/jira/browse/SPARK-5193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277656#comment-14277656 ] Apache Spark commented on SPARK-5193: - User 'rxin' has created a pull request for this issue: https://github.com/apache/spark/pull/4049 > Make Spark SQL API usable in Java and remove the Java-specific API > -- > > Key: SPARK-5193 > URL: https://issues.apache.org/jira/browse/SPARK-5193 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Reynold Xin > > Java version of the SchemaRDD API causes high maintenance burden for Spark > SQL itself and downstream libraries (e.g. MLlib pipeline API needs to support > both JavaSchemaRDD and SchemaRDD). We can audit the Scala API and make it > usable for Java, and then we can remove the Java specific version. > Things to remove include (Java version of): > - data type > - Row > - SQLContext > - HiveContext > Things to consider: > - Scala and Java have a different collection library. > - Scala and Java (8) have different closure interface. > - Scala and Java can have duplicate definitions of common classes, such as > BigDecimal. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277651#comment-14277651 ] Imran Rashid commented on SPARK-4746: - btw if anybody else wants to futz around with the times of the tests, here is my hacky script to parse the sbt output to get the test times, requires a bit of manual filtering of the sbt output as well: https://gist.github.com/squito/54afab78f41dd11312ad > integration tests should be separated from faster unit tests > > > Key: SPARK-4746 > URL: https://issues.apache.org/jira/browse/SPARK-4746 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > Currently there isn't a good way for a developer to skip the longer > integration tests. This can slow down local development. See > http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html > One option is to use scalatest's notion of test tags to tag all integration > tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277646#comment-14277646 ] Imran Rashid commented on SPARK-4746: - submitted a PR: https://github.com/apache/spark/pull/4048 My decision of what to choose as IntegrationTests was somewhat arbitrary. But it does have the benefit that (a) all unit tests now run in about 5 mins on my laptop and (b) you don't need to do a full {{mvn package}} before running the unit tests. I took a brief look at what was taking up the time in the remaining tests. Most of the tests are pretty fast, there are a few that are taking up most of the time. In fact, it looks like we can get the time down by ~50% if we move the slowest 10 remaining tests to Integration tests as well: ConnectionManagerSuite: - sendMessageReliably timeout MapOutputTrackerSuite: - remote fetch exceeds akka frame size RDDSuite: - takeSample XORShiftRandomSuite - XORShift generates valid random numbers ExternalSorterSuite: - cleanup of intermediate files in sorter SparkListenerSuite: - local metrics ContextCleanerSuite: - automatically cleanup RDD + shuffle + broadcast ExternalSorterSuite: - cleanup of intermediate files in sorter, bypass merge-sort ExternalSorterSuite: - sorting without aggregation, with spill ExternalSorterSuite: - empty partitions with spilling TaskSetManagerSuite: - abort the job if total size of results is too large of course you have to go a little bit deeper to get more gains, but still you could go a lot further eg. if you move the 60 slowest then you're down to 25% of the time. This may be opening the door to an endless debate about where we draw the line on what is a unit test vs integration test. But one nice thing about the test-tag based approach is that we could go finer grained if we want, and its pretty easy for a developer to customize which set of tests they want to run when developing locally. > integration tests should be separated from faster unit tests > > > Key: SPARK-4746 > URL: https://issues.apache.org/jira/browse/SPARK-4746 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > Currently there isn't a good way for a developer to skip the longer > integration tests. This can slow down local development. See > http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html > One option is to use scalatest's notion of test tags to tag all integration > tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631 ] RJ Nowling edited comment on SPARK-4894 at 1/14/15 8:50 PM: Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py (Look at `_joint_log_likelihood` in the `MultinomialNB` and `BernoulliNB` classes.) Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~lmcguire], [~josephkb] Any thoughts or comments on any of the above? was (Author: rnowling): Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~josephkb] Any thoughts or comments? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3726) RandomForest: Support for bootstrap options
[ https://issues.apache.org/jira/browse/SPARK-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277643#comment-14277643 ] Manoj Kumar commented on SPARK-3726: [~josephkb] You seem to report issues that I always think I can have a decent shot at :) I would like to submit a PR for this by the end of the week. > RandomForest: Support for bootstrap options > --- > > Key: SPARK-3726 > URL: https://issues.apache.org/jira/browse/SPARK-3726 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > RandomForest uses BaggedPoint to simulate bootstrapped samples of the data. > The expected size of each sample is the same as the original data (sampling > rate = 1.0), and sampling is done with replacement. Adding support for other > sampling rates and for sampling without replacement would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4746) integration tests should be separated from faster unit tests
[ https://issues.apache.org/jira/browse/SPARK-4746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277633#comment-14277633 ] Apache Spark commented on SPARK-4746: - User 'squito' has created a pull request for this issue: https://github.com/apache/spark/pull/4048 > integration tests should be separated from faster unit tests > > > Key: SPARK-4746 > URL: https://issues.apache.org/jira/browse/SPARK-4746 > Project: Spark > Issue Type: Bug >Reporter: Imran Rashid >Priority: Trivial > > Currently there isn't a good way for a developer to skip the longer > integration tests. This can slow down local development. See > http://apache-spark-developers-list.1001551.n3.nabble.com/Spurious-test-failures-testing-best-practices-td9560.html > One option is to use scalatest's notion of test tags to tag all integration > tests, so they could easily be skipped -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277631#comment-14277631 ] RJ Nowling commented on SPARK-4894: --- Thanks [~lmcguire]! I'll wait until next week in case you have time to put a patch together. In the mean time, here were my thoughts for changes: 1. Add an optional `model` variable to the `NaiveBayes` object and class and `NaiveBayesModel`. It would be a string with a default value of `Multinomial`. For Bernoulli, we can use `Bernoulli`. 2. In `NaiveBayesModel.predict`, we should compute and store `brzPi + brzTheta * testData.toBreeze`. If `testData(i)` is 0, then `brzTheta * testData.toBreeze` will be 0. If Bernoulli is enabled, we add `log(1 - exp(brzTheta)) * (1 - testData.toBreeze)` to account for the probabilities for the 0-valued features. (Breeze may not allow adding/subtracting scalars and vectors/matrices.) In the current model, no term is added for rows of `testData` that have 0 entries. In the Bernoulli model, we would be adding a separate term for 0-valued features. Here is the sklearn source for comparison: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/naive_bayes.py Note that sklearn adds the neg prob to all features and subtracts it from features with 1-values. [~mengxr], [~josephkb] Any thoughts or comments? > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277611#comment-14277611 ] Alex Baretta commented on SPARK-5235: - [~rxin] I see your point. Well, listen, I appreciate you merging my PR. I can definitely rework my code to get the SQLContext out of the picture. Still, before you commit a change removing the Serializable trait from SQLContext you probably want to let everyone know by dropping a note the dev list first, to give people time to refactor their code. > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Sub-task >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277590#comment-14277590 ] Leah McGuire commented on SPARK-4894: - Thanks [~rnowling]! I can take a look at it this week and at least confirm that the docs are incorrect and outline the changes that need to occur. If I have time I will try to send you a patch this week, if not I'm happy to help with review and testing for your patch. > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277578#comment-14277578 ] Xiangrui Meng commented on SPARK-4894: -- [~rnowling] I've assigned this to you. Let's also have [~josephkb] in the loop, who mentioned this multiple times offline. It would be good if you can briefly describe your plan, e.g., API changes and the implementation. Thanks! > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4894: - Assignee: RJ Nowling > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4894) Add Bernoulli-variant of Naive Bayes
[ https://issues.apache.org/jira/browse/SPARK-4894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-4894: - Affects Version/s: (was: 1.1.1) 1.2.0 > Add Bernoulli-variant of Naive Bayes > > > Key: SPARK-4894 > URL: https://issues.apache.org/jira/browse/SPARK-4894 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.2.0 >Reporter: RJ Nowling >Assignee: RJ Nowling > > MLlib only supports the multinomial-variant of Naive Bayes. The Bernoulli > version of Naive Bayes is more useful for situations where the features are > binary values. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5234: - Assignee: yuhao yang > examples for ml don't have sparkContext.stop > > > Key: SPARK-5234 > URL: https://issues.apache.org/jira/browse/SPARK-5234 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 > Environment: all >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Trivial > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Not sure why sc.stop() is not in the > org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, > SimpleTextClassificationPipeline}. > I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5234: - Target Version/s: 1.3.0, 1.2.1 (was: 1.3.0) > examples for ml don't have sparkContext.stop > > > Key: SPARK-5234 > URL: https://issues.apache.org/jira/browse/SPARK-5234 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 > Environment: all >Reporter: yuhao yang >Priority: Trivial > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Not sure why sc.stop() is not in the > org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, > SimpleTextClassificationPipeline}. > I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-5234: - Fix Version/s: 1.2.1 > examples for ml don't have sparkContext.stop > > > Key: SPARK-5234 > URL: https://issues.apache.org/jira/browse/SPARK-5234 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 > Environment: all >Reporter: yuhao yang >Priority: Trivial > Fix For: 1.3.0, 1.2.1 > > Original Estimate: 1h > Remaining Estimate: 1h > > Not sure why sc.stop() is not in the > org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, > SimpleTextClassificationPipeline}. > I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5234) examples for ml don't have sparkContext.stop
[ https://issues.apache.org/jira/browse/SPARK-5234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5234. -- Resolution: Fixed Issue resolved by pull request 4044 [https://github.com/apache/spark/pull/4044] > examples for ml don't have sparkContext.stop > > > Key: SPARK-5234 > URL: https://issues.apache.org/jira/browse/SPARK-5234 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.2.0 > Environment: all >Reporter: yuhao yang >Priority: Trivial > Fix For: 1.3.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > Not sure why sc.stop() is not in the > org.apache.spark.examples.ml {CrossValidatorExample, SimpleParamsExample, > SimpleTextClassificationPipeline}. > I can prepare a PR if it's not intentional to omit the call to stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277561#comment-14277561 ] Reynold Xin commented on SPARK-5235: I will merge your PR and we can continue having this discussion. For 1.3 we do plan to stabilize Spark SQL external APIs. A lot of APIs will actually change (we will likely end up breaking the programming API's binary compatibility and source compatibility) to make a better API. This is actually a good chance to do it. Once we are done with that, Spark SQL can graduate from alpha and APIs will no longer change. > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Sub-task >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277554#comment-14277554 ] vincent ye commented on SPARK-5206: --- Hi Tathagata, Accumulator object is created after the StreamingContext (ssc) created using: val counter = ssc.sparkContext.accumulator() The way I create the recoverable ssc is like this: val ssc = StreamingContext.getOrCreate("/spark/applicationName", functionToCreateContext()) def functionToCreateContext = { val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val counter = ssc.sparkContext.accumulator(0L, "message received") . dstream.foreach(counter += 1) .. ssc } ssc.start ssc.awaitTermination() If the app is recovering from checkpoint, It won't execute functionToCreateContext. Then the counter of Accumulator won't be instantiated and registered to Accumulators singleton object. But the counter inside dstream.foreach(counter += 1) will be created by deserializer on a remote worker which won't register the counter to the driver. I don't understand what you mean explicity referencing the Accumulator object .. Thanks > Accumulators are not re-registered during recovering from checkpoint > > > Key: SPARK-5206 > URL: https://issues.apache.org/jira/browse/SPARK-5206 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: vincent ye > > I got exception as following while my streaming application restarts from > crash from checkpoit: > 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR > scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, > 4) > java.util.NoSuchElementException: key not found: 1 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > I guess that an Accumulator is registered to a singleton Accumulators in Line > 58 of org.apache.spark.Accumulable: > Accumulators.register(this, true) > This code need to be executed in the driver once. But when the application is > recovered from checkpoint. It won't be executed in the driver. So when the > driver process it at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), > It can't find the Accumulator because it's not re-register during the > recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277552#comment-14277552 ] vincent ye commented on SPARK-5206: --- Hi Tathagata, Accumulator object is created after the StreamingContext (ssc) created using: val counter = ssc.sparkContext.accumulator() The way I create the recoverable ssc is like this: val ssc = StreamingContext.getOrCreate("/spark/applicationName", functionToCreateContext()) def functionToCreateContext = { val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val counter = ssc.sparkContext.accumulator(0L, "message received") . dstream.foreach(counter += 1) .. ssc } ssc.start ssc.awaitTermination() If the app is recovering from checkpoint, It won't execute functionToCreateContext. Then the counter of Accumulator won't be instantiated and registered to Accumulators singleton object. But the counter inside dstream.foreach(counter += 1) will be created by deserializer on a remote worker which won't register the counter to the driver. I don't understand what you mean explicity referencing the Accumulator object .. Thanks > Accumulators are not re-registered during recovering from checkpoint > > > Key: SPARK-5206 > URL: https://issues.apache.org/jira/browse/SPARK-5206 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: vincent ye > > I got exception as following while my streaming application restarts from > crash from checkpoit: > 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR > scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, > 4) > java.util.NoSuchElementException: key not found: 1 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > I guess that an Accumulator is registered to a singleton Accumulators in Line > 58 of org.apache.spark.Accumulable: > Accumulators.register(this, true) > This code need to be executed in the driver once. But when the application is > recovered from checkpoint. It won't be executed in the driver. So when the > driver process it at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), > It can't find the Accumulator because it's not re-register during the > recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277555#comment-14277555 ] Alex Baretta commented on SPARK-5235: - [~rxin] Could be. All I'm saying is that your change was not intended to make SQLContext not Serializable. Even if we all agree that it would be cleaner, "taking the opportunity" offered by this regression to remove the Serializable trait from SQLContext is not a good idea, as there is no emergency here. Printing a warning, writing something in docs of version 1.3 and then waiting until 1.4 would be a better process. > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Sub-task >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277553#comment-14277553 ] Sean Owen commented on SPARK-5235: -- [~alexbaretta] I suppose my point is that no code can possibly rely on serializing {{SQLContext}}, since even before the recent change, it would deserialize with no {{SparkContext}} and be unable to do anything. It can only be accidentally serialized. The issue is that it would turn a silent error into an explicit one, yes. I'm also not 100% sure there's not some binary compatibility issue. If not, yes perhaps somehow stick in a warning by overriding the Java serialization mechanism method to print it. > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Sub-task >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-5206) Accumulators are not re-registered during recovering from checkpoint
[ https://issues.apache.org/jira/browse/SPARK-5206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] vincent ye updated SPARK-5206: -- Comment: was deleted (was: Hi Tathagata, Accumulator object is created after the StreamingContext (ssc) created using: val counter = ssc.sparkContext.accumulator() The way I create the recoverable ssc is like this: val ssc = StreamingContext.getOrCreate("/spark/applicationName", functionToCreateContext()) def functionToCreateContext = { val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val counter = ssc.sparkContext.accumulator(0L, "message received") . dstream.foreach(counter += 1) .. ssc } ssc.start ssc.awaitTermination() If the app is recovering from checkpoint, It won't execute functionToCreateContext. Then the counter of Accumulator won't be instantiated and registered to Accumulators singleton object. But the counter inside dstream.foreach(counter += 1) will be created by deserializer on a remote worker which won't register the counter to the driver. I don't understand what you mean explicity referencing the Accumulator object .. Thanks ) > Accumulators are not re-registered during recovering from checkpoint > > > Key: SPARK-5206 > URL: https://issues.apache.org/jira/browse/SPARK-5206 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.1.0 >Reporter: vincent ye > > I got exception as following while my streaming application restarts from > crash from checkpoit: > 15/01/12 10:31:06 sparkDriver-akka.actor.default-dispatcher-4 ERROR > scheduler.DAGScheduler: Failed to update accumulators for ShuffleMapTask(41, > 4) > java.util.NoSuchElementException: key not found: 1 > at scala.collection.MapLike$class.default(MapLike.scala:228) > at scala.collection.AbstractMap.default(Map.scala:58) > at scala.collection.mutable.HashMap.apply(HashMap.scala:64) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:939) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskCompletion$1.apply(DAGScheduler.scala:938) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98) > at > scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226) > at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39) > at scala.collection.mutable.HashMap.foreach(HashMap.scala:98) > at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1388) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > at akka.actor.ActorCell.invoke(ActorCell.scala:456) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > at akka.dispatch.Mailbox.run(Mailbox.scala:219) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > I guess that an Accumulator is registered to a singleton Accumulators in Line > 58 of org.apache.spark.Accumulable: > Accumulators.register(this, true) > This code need to be executed in the driver once. But when the application is > recovered from checkpoint. It won't be executed in the driver. So when the > driver process it at > org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:938), > It can't find the Accumulator because it's not re-register during the > recovery. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4014) TaskContext.attemptId returns taskId
[ https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-4014: -- Target Version/s: 1.0.3, 1.1.2, 1.2.1 Assignee: Josh Rosen Labels: backport-needed (was: ) > TaskContext.attemptId returns taskId > > > Key: SPARK-4014 > URL: https://issues.apache.org/jira/browse/SPARK-4014 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Assignee: Josh Rosen >Priority: Minor > Labels: backport-needed > Fix For: 1.3.0 > > > In TaskRunner, we assign the taskId of a task to the attempId of the > corresponding TaskContext. Should we rename attemptId to taskId to avoid > confusion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-4014) TaskContext.attemptId returns taskId
[ https://issues.apache.org/jira/browse/SPARK-4014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-4014. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 3849 [https://github.com/apache/spark/pull/3849] > TaskContext.attemptId returns taskId > > > Key: SPARK-4014 > URL: https://issues.apache.org/jira/browse/SPARK-4014 > Project: Spark > Issue Type: Bug > Components: Spark Core >Reporter: Yin Huai >Priority: Minor > Labels: backport-needed > Fix For: 1.3.0 > > > In TaskRunner, we assign the taskId of a task to the attempId of the > corresponding TaskContext. Should we rename attemptId to taskId to avoid > confusion? -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5235: --- Summary: Determine serializability of SQLContext (was: java.io.NotSerializableException: org.apache.spark.sql.SQLConf) > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5235) Determine serializability of SQLContext
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5235: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-5166 > Determine serializability of SQLContext > --- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Sub-task >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5019) Update GMM API to use MultivariateGaussian
[ https://issues.apache.org/jira/browse/SPARK-5019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277538#comment-14277538 ] Joseph K. Bradley commented on SPARK-5019: -- [~lewuathe] Will you be able to update yours soon? It would be great to get this patch in right away since it will help finalize the public API. Thanks! If you don't have time, please let us know so that [~tgaloppo] can submit his. > Update GMM API to use MultivariateGaussian > -- > > Key: SPARK-5019 > URL: https://issues.apache.org/jira/browse/SPARK-5019 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.2.0 >Reporter: Joseph K. Bradley >Priority: Blocker > > The GaussianMixtureModel API should expose MultivariateGaussian instances > instead of the means and covariances. This should be fixed as soon as > possible to stabilize the API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277539#comment-14277539 ] Reynold Xin commented on SPARK-5235: How would we deprecate that though? Are you suggesting we print a runtime warning when SQLContext is being serialized? > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277533#comment-14277533 ] Alex Baretta commented on SPARK-5235: - [~sowen] I would much rather the decision of making SQLContext not Serializable go through some review. Usually API changes are preceded by a time of "deprecation" of the older API. I don't think there is any emergent need here to break code relying on the serializability of SQLContext without warning. But, yes, longer term I'd want to structure my code so that the SQLContext is not commigled with objects that need to be shipped around. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3717) DecisionTree, RandomForest: Partition by feature
[ https://issues.apache.org/jira/browse/SPARK-3717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277485#comment-14277485 ] Joseph K. Bradley commented on SPARK-3717: -- [~bbnsumanth] Please do not misunderstand: We appreciate your efforts towards contributing. However, when someone starts contributing to Spark, it is important for them to start with smaller tasks to get used to the PR review process, code style requirements, etc. We asked you to follow this convention from the beginning and in multiple messages, but I do not see any responses from you to our suggestions. The main issue is that we and other contributors have limited time. If a patch is very large and requires lots of reviewing, it is a very large time commitment. If you become a regular contributor starting with small patches, then other reviewers will get used to reviewing your code and will be more willing to commit the many hours required for reviewing a large feature. Here is what I would recommend: * Keep your current code. You could even prepare it as a package for Spark if you would like others to start using it right away. * Start contributing using small patches. Get used to the review process and coding conventions. * Re-submit your random forest implementation once you have done other smaller contributions. I hope you do not get discouraged from contributing. We simply need to follow regular contribution procedures to make this very large project manageable. Please refer to [https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] if it is useful. > DecisionTree, RandomForest: Partition by feature > > > Key: SPARK-3717 > URL: https://issues.apache.org/jira/browse/SPARK-3717 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > h1. Summary > Currently, data are partitioned by row/instance for DecisionTree and > RandomForest. This JIRA argues for partitioning by feature for training deep > trees. This is especially relevant for random forests, which are often > trained to be deeper than single decision trees. > h1. Details > Dataset dimensions and the depth of the tree to be trained are the main > problem parameters determining whether it is better to partition features or > instances. For random forests (training many deep trees), partitioning > features could be much better. > Notation: > * P = # workers > * N = # instances > * M = # features > * D = depth of tree > h2. Partitioning Features > Algorithm sketch: > * Each worker stores: > ** a subset of columns (i.e., a subset of features). If a worker stores > feature j, then the worker stores the feature value for all instances (i.e., > the whole column). > ** all labels > * Train one level at a time. > * Invariants: > ** Each worker stores a mapping: instance → node in current level > * On each iteration: > ** Each worker: For each node in level, compute (best feature to split, info > gain). > ** Reduce (P x M) values to M values to find best split for each node. > ** Workers who have features used in best splits communicate left/right for > relevant instances. Gather total of N bits to master, then broadcast. > * Total communication: > ** Depth D iterations > ** On each iteration, reduce to M values (~8 bytes each), broadcast N values > (1 bit each). > ** Estimate: D * (M * 8 + N) > h2. Partitioning Instances > Algorithm sketch: > * Train one group of nodes at a time. > * Invariants: > * Each worker stores a mapping: instance → node > * On each iteration: > ** Each worker: For each instance, add to aggregate statistics. > ** Aggregate is of size (# nodes in group) x M x (# bins) x (# classes) > *** (“# classes” is for classification. 3 for regression) > ** Reduce aggregate. > ** Master chooses best split for each node in group and broadcasts. > * Local training: Once all instances for a node fit on one machine, it can be > best to shuffle data and training subtrees locally. This can mean shuffling > the entire dataset for each tree trained. > * Summing over all iterations, reduce to total of: > ** (# nodes in tree) x M x (# bins B) x (# classes C) values (~8 bytes each) > ** Estimate: 2^D * M * B * C * 8 > h2. Comparing Partitioning Methods > Partitioning features cost < partitioning instances cost when: > * D * (M * 8 + N) < 2^D * M * B * C * 8 > * D * N < 2^D * M * B * C * 8 (assuming D * M * 8 is small compared to the > right hand side) > * N < [ 2^D * M * B * C * 8 ] / D > Example: many instances: > * 2 million instances, 3500 features, 100 bins, 5 classes, 6 levels (depth = > 5) > * Partitioning features: 6 * ( 3500 * 8 + 2*10^6 ) =~ 1.2 * 10^7 > * Partitioning instances: 32 * 3500 * 100 * 5 * 8 =~ 4.5 * 10^8
[jira] [Updated] (SPARK-5228) Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-5228: -- Assignee: Kousuke Saruta > Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty > > > Key: SPARK-5228 > URL: https://issues.apache.org/jira/browse/SPARK-5228 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages > and Failed Stages are hidden when they are empty while tables for Active > Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5228) Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty
[ https://issues.apache.org/jira/browse/SPARK-5228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-5228. --- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4028 [https://github.com/apache/spark/pull/4028] > Hide tables for "Active Jobs/Completed Jobs/Failed Jobs" when they are empty > > > Key: SPARK-5228 > URL: https://issues.apache.org/jira/browse/SPARK-5228 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 1.3.0 >Reporter: Kousuke Saruta >Assignee: Kousuke Saruta > Fix For: 1.3.0 > > > In current WebUI, tables for Active Stages, Completed Stages, Skipped Stages > and Failed Stages are hidden when they are empty while tables for Active > Jobs, Completed Jobs and Failed Jobs are not hidden though they are empty. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277475#comment-14277475 ] Sean Owen commented on SPARK-5235: -- [~alexbaretta] Well at least that explains why tests didn't fail, that would have been really puzzling. This also at least gives you a quick route to getting your program working and it's more reliable. In a way it's almost a good thing, to me, that it happens to no longer serialize. Separately, I'd be curious what the API-compatibility implications are of making the class not extend Serializable and leave it at that. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2909) Indexing for SparseVector in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-2909: - Assignee: Manoj Kumar > Indexing for SparseVector in pyspark > > > Key: SPARK-2909 > URL: https://issues.apache.org/jira/browse/SPARK-2909 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Manoj Kumar >Priority: Minor > Fix For: 1.3.0 > > > SparseVector in pyspark does not currently support indexing, except by > examining the internal representation. Though indexing is a pricy operation, > it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) > and operating on a single feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277460#comment-14277460 ] Alex Baretta commented on SPARK-5235: - [~rxin] [~sowen] My bad! Indeed the SQLContext gets captured in a closure in my own code, passed to a flatMap. I did not notice this previously because the SQLContext was Serializable. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-2909) Indexing for SparseVector in pyspark
[ https://issues.apache.org/jira/browse/SPARK-2909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-2909. -- Resolution: Fixed Fix Version/s: 1.3.0 Issue resolved by pull request 4025 [https://github.com/apache/spark/pull/4025] > Indexing for SparseVector in pyspark > > > Key: SPARK-2909 > URL: https://issues.apache.org/jira/browse/SPARK-2909 > Project: Spark > Issue Type: Improvement > Components: MLlib, PySpark >Reporter: Joseph K. Bradley >Priority: Minor > Fix For: 1.3.0 > > > SparseVector in pyspark does not currently support indexing, except by > examining the internal representation. Though indexing is a pricy operation, > it would be useful for, e.g., iterating through a dataset (RDD[LabeledPoint]) > and operating on a single feature. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1405) parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib
[ https://issues.apache.org/jira/browse/SPARK-1405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277449#comment-14277449 ] Apache Spark commented on SPARK-1405: - User 'jkbradley' has created a pull request for this issue: https://github.com/apache/spark/pull/4047 > parallel Latent Dirichlet Allocation (LDA) atop of spark in MLlib > - > > Key: SPARK-1405 > URL: https://issues.apache.org/jira/browse/SPARK-1405 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Xusen Yin >Assignee: Guoqiang Li >Priority: Critical > Labels: features > Attachments: performance_comparison.png > > Original Estimate: 336h > Remaining Estimate: 336h > > Latent Dirichlet Allocation (a.k.a. LDA) is a topic model which extracts > topics from text corpus. Different with current machine learning algorithms > in MLlib, instead of using optimization algorithms such as gradient desent, > LDA uses expectation algorithms such as Gibbs sampling. > In this PR, I prepare a LDA implementation based on Gibbs sampling, with a > wholeTextFiles API (solved yet), a word segmentation (import from Lucene), > and a Gibbs sampling core. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277414#comment-14277414 ] Alex Baretta commented on SPARK-5235: - [~rxin] I'm sorry to say it's not that easy, especially since I can't really run any of my code off of master: I need to apply a few outstanding PRs pertaining to Parquet in addition to this one to get any of my code to run at all. I'll try to reconstruct the configuration that was failing, but it will take a while. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)
[ https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277334#comment-14277334 ] Shivaram Venkataraman commented on SPARK-3821: -- Regarding the pre-built distributions, AFAIK we don't support full Hadoop2 as in YARN. We run CDH4 which has some parts of Hadoop2, but with MapReduce. There is an open PR to add support for Hadoop2 at https://github.com/mesos/spark-ec2/pull/77 and and you can see that it gets the right [prebuilt Spark|https://github.com/mesos/spark-ec2/pull/77/files#diff-1d040c3294246f2b59643d63868fc2adR97] in that case > Develop an automated way of creating Spark images (AMI, Docker, and others) > --- > > Key: SPARK-3821 > URL: https://issues.apache.org/jira/browse/SPARK-3821 > Project: Spark > Issue Type: Improvement > Components: Build, EC2 >Reporter: Nicholas Chammas >Assignee: Nicholas Chammas > Attachments: packer-proposal.html > > > Right now the creation of Spark AMIs or Docker containers is done manually. > With tools like [Packer|http://www.packer.io/], we should be able to automate > this work, and do so in such a way that multiple types of machine images can > be created from a single template. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
[ https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277290#comment-14277290 ] Reynold Xin commented on SPARK-5211: BTW who are the developers using it? > Restore HiveMetastoreTypes.toDataType > - > > Key: SPARK-5211 > URL: https://issues.apache.org/jira/browse/SPARK-5211 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5211) Restore HiveMetastoreTypes.toDataType
[ https://issues.apache.org/jira/browse/SPARK-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277292#comment-14277292 ] Reynold Xin commented on SPARK-5211: I'm under the impression that everything in the Hive package that is not HiveContext should not have been a public API. > Restore HiveMetastoreTypes.toDataType > - > > Key: SPARK-5211 > URL: https://issues.apache.org/jira/browse/SPARK-5211 > Project: Spark > Issue Type: Bug > Components: SQL >Reporter: Yin Huai >Priority: Critical > > It was a public API. Since developers are using it, we need to get it back. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5245) Move Decimal from types.decimal to types package
[ https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5245: --- Summary: Move Decimal from types.decimal to types package (was: Move Decimal and Date into o.a.s.sql.types) > Move Decimal from types.decimal to types package > > > Key: SPARK-5245 > URL: https://issues.apache.org/jira/browse/SPARK-5245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin > Fix For: 1.3.0 > > > We probably don't want to (and want to) create a new package for each type. > We can just put them in org.apache.spark.sql.types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5245) Move Decimal from types.decimal to types package
[ https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5245. Resolution: Fixed Fix Version/s: 1.3.0 Fixed in https://github.com/apache/spark/pull/4041 > Move Decimal from types.decimal to types package > > > Key: SPARK-5245 > URL: https://issues.apache.org/jira/browse/SPARK-5245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Adrian Wang > Fix For: 1.3.0 > > > We probably don't want to (and want to) create a new package for each type. > We can just put them in org.apache.spark.sql.types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5245) Move Decimal from types.decimal to types package
[ https://issues.apache.org/jira/browse/SPARK-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5245: --- Assignee: Adrian Wang > Move Decimal from types.decimal to types package > > > Key: SPARK-5245 > URL: https://issues.apache.org/jira/browse/SPARK-5245 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Reynold Xin >Assignee: Adrian Wang > Fix For: 1.3.0 > > > We probably don't want to (and want to) create a new package for each type. > We can just put them in org.apache.spark.sql.types. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5248) moving Decimal from types.decimal to types package
[ https://issues.apache.org/jira/browse/SPARK-5248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-5248. Resolution: Fixed Fix Version/s: 1.3.0 Assignee: Adrian Wang Fixed in https://github.com/apache/spark/pull/4041 > moving Decimal from types.decimal to types package > -- > > Key: SPARK-5248 > URL: https://issues.apache.org/jira/browse/SPARK-5248 > Project: Spark > Issue Type: Improvement >Reporter: Adrian Wang >Assignee: Adrian Wang >Priority: Minor > Fix For: 1.3.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277270#comment-14277270 ] Reynold Xin commented on SPARK-5235: I can merge your PR soon, but [~alexbaretta] can you also run with -Dsun.io.serialization.extendeddebuginfo=true as [~srowen] suggested and find the cause of that? That should take only a minute. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > {code} > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5235: --- Description: The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. {code} Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) {code} was: The SQLConf field in SQLContext is neither Serializable nor transient. Here's the stack trace I get when running SQL queries against a Parquet file. Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.sql.SQLConf at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) at akka.actor.Actor$class.aroundReceive(Actor.scala:465) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) at akka.actor.ActorCell.invoke(ActorCell.scala:487) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) at akka.dispatch.Mailbox.run(Mailbox.scala:220) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > java.io.NotSerializableException: org.apache.spark.sql.SQLConf >
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277258#comment-14277258 ] Alex Baretta commented on SPARK-5235: - Yes, there is a need for a hotfix. [~rxin] committed a change les than 24 hours ago to SQLContext, splitting out SQLConf as a field in SQLContext instead of a mixin. The new field should have been declared either serializable or transient. https://github.com/apache/spark/commit/14e3f114efb906937b2d7b7ac04484b2814a3b48 > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277251#comment-14277251 ] Sean Owen commented on SPARK-5235: -- @Alex Baretta what version worked? If you're saying a recent change completely broke this, a quick hotfix is appropriate. However the Jenkins tests are passing (modulo the usual flakes): https://amplab.cs.berkeley.edu/jenkins/view/Spark/ So I wonder if there's a hurry to put this band-aid on. That extended debug info might point straight to an easy real fix; more info on what triggers this would be good to compare with the unit tests that are working. If it turns out to be a mess, sure, the band-aid might be better than nothing int he short term. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs
[ https://issues.apache.org/jira/browse/SPARK-5252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lutz Buech updated SPARK-5252: -- Attachment: debug.txt log at DEBUG level > Streaming StatefulNetworkWordCount example hangs > > > Key: SPARK-5252 > URL: https://issues.apache.org/jira/browse/SPARK-5252 > Project: Spark > Issue Type: Bug > Components: Streaming >Affects Versions: 1.2.0 > Environment: Ubuntu Linux >Reporter: Lutz Buech > Attachments: debug.txt > > > Running the stateful network word count example in Python (on one local node): > https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py > At the beginning, when no data is streamed, empty status outputs are > generated, only decorated by the current Time, e.g.: > --- > Time: 2015-01-14 17:58:20 > --- > --- > Time: 2015-01-14 17:58:21 > --- > As soon as I stream some data via netcat, no new status updates will show. > Instead, one line saying > [Stage :> > (2 + 0) / 3] > where is some integer number, e.g. 132. There is no further output > on stdout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277240#comment-14277240 ] Alex Baretta commented on SPARK-5235: - [~sowen] I agree with you that contexts have no business getting serialized and shipped around the cluster. That being said, this issue is a regression, as this simply used to work. Since this is a regression, it is very much appropriate to fix the issue by restoring the previous behavior, and then take time to think of a better design. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-5252) Streaming StatefulNetworkWordCount example hangs
Lutz Buech created SPARK-5252: - Summary: Streaming StatefulNetworkWordCount example hangs Key: SPARK-5252 URL: https://issues.apache.org/jira/browse/SPARK-5252 Project: Spark Issue Type: Bug Components: Streaming Affects Versions: 1.2.0 Environment: Ubuntu Linux Reporter: Lutz Buech Running the stateful network word count example in Python (on one local node): https://github.com/apache/spark/blob/master/examples/src/main/python/streaming/stateful_network_wordcount.py At the beginning, when no data is streamed, empty status outputs are generated, only decorated by the current Time, e.g.: --- Time: 2015-01-14 17:58:20 --- --- Time: 2015-01-14 17:58:21 --- As soon as I stream some data via netcat, no new status updates will show. Instead, one line saying [Stage :> (2 + 0) / 3] where is some integer number, e.g. 132. There is no further output on stdout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5235) java.io.NotSerializableException: org.apache.spark.sql.SQLConf
[ https://issues.apache.org/jira/browse/SPARK-5235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14277217#comment-14277217 ] Sean Owen commented on SPARK-5235: -- [~alexbaretta] It certainly may not be your code of course. I mean "people" including the Spark code. But surely the problem is solved exactly by not trying to serialize {{SQLContext}}, no? despite its declaration, as you've demonstrated, it does not serialize, and was not designed to be used after serialization given the {{@transient}} field. You've suggested a reasonable band-aid on a band-aid but I would either like to fix the cause or understand why it's actually supposed to act this way. Other Contexts in Spark are not supposed to be serialized. Where I've seen this pattern before in the unit tests, it was certainly a hack for convenience that didn't matter much because it was just a test. Can you run with {{-Dsun.io.serialization.extendeddebuginfo=true}}? this will show exactly what had the reference to {{SQLContext}}. > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > -- > > Key: SPARK-5235 > URL: https://issues.apache.org/jira/browse/SPARK-5235 > Project: Spark > Issue Type: Bug >Reporter: Alex Baretta > > The SQLConf field in SQLContext is neither Serializable nor transient. Here's > the stack trace I get when running SQL queries against a Parquet file. > Exception in thread "Thread-43" org.apache.spark.SparkException: Job aborted > due to stage failure: Task not serializable: > java.io.NotSerializableException: org.apache.spark.sql.SQLConf > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1195) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1184) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1183) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > at > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1183) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:843) > at > org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:779) > at > org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:763) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1364) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessActor.aroundReceive(DAGScheduler.scala:1356) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238) > at akka.dispatch.Mailbox.run(Mailbox.scala:220) > at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393) > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org