[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6884: - Affects Version/s: (was: 1.4.0) 1.3.0 random forest predict probabilities functionality (like in sklearn) --- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492888#comment-14492888 ] Patrick Wendell commented on SPARK-6703: Hey [~ilganeli] - sure thing. I've pinged a couple of people to provide feedback on the design. Overall I think it won't be a complicated feature to implement. I've added you as the assignee. One note, if it gets very close to the 1.4 code freeze I may need to help take it across the finish line. But for now why don't you go ahead, I think we'll be fine. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6703: --- Assignee: Ilya Ganelin Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492928#comment-14492928 ] Joseph K. Bradley commented on SPARK-3727: -- [~maxkaznady] [~mqk] I split this into some subtasks, and we can add others later (for boosted trees, for regression, etc.). It will be great if you can follow the spark.ml tree API JIRA (linked above) and take a look at it once it's posted. That (and the ProbabilisticClassifier class) will give you an idea of what's entailed in adding these under the Pipelines API. Do you have preferences on how to split up these tasks? If you can figure that out, I'll be happy to assign them. Thanks! Trees and ensembles: More prediction functionality -- Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839 ] Max Kaznady commented on SPARK-3727: I implemented the same thing but for PySpark. Since there is no existing function, should I just call the function predict_proba like in sklearn? Also, does it make sense to open a new ticket for this, since it's so specific? Thanks, Max DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492959#comment-14492959 ] Max Kaznady commented on SPARK-6113: [~josephkb] Is it possible to host the API Design doc on something other than Google Docs? My (and most other) corporate policies forbid access to Google Docs, so I cannot download the file. Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address
[ https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves resolved SPARK-6662. -- Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Cheolsoo Park Allow variable substitution in spark.yarn.historyServer.address --- Key: SPARK-6662 URL: https://issues.apache.org/jira/browse/SPARK-6662 Project: Spark Issue Type: Wish Components: YARN Affects Versions: 1.3.0 Reporter: Cheolsoo Park Assignee: Cheolsoo Park Priority: Minor Labels: yarn Fix For: 1.4.0 In Spark on YARN, explicit hostname and port number need to be set for spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem. But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is allowed. In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492887#comment-14492887 ] Joseph K. Bradley commented on SPARK-3727: -- Thanks for your initial works on this ticket! The main issue with this extension is API stability: Modifying the existing classes will also make us have to update model save/load versioning, default constructors to ensure binary compatibility, etc. I just linked a JIRA which discusses updating the tree and ensemble APIs under the spark.ml package, which will permit us to redesign the APIs (and make it easier to specify class probabilities or stats for regression). What I'd like to do is get the tree API updates in (this week), and then we could work together to make the class probabilities available under the new API. Does that sound good? Also, if you're new to contributing to Spark, please make sure to check out: [https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark] Thanks! DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Patrick Wendell updated SPARK-6703: --- Priority: Critical (was: Major) Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6884: - Assignee: Max Kaznady Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Assignee: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-6884: - Assignee: (was: Max Kaznady) Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492898#comment-14492898 ] Patrick Wendell commented on SPARK-6703: /cc [~velvia] Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs
[ https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492989#comment-14492989 ] Max Kaznady commented on SPARK-6113: Other places need serious improvement as well, LogisticRegressionWithLBFGS is another example. All LogisticRegression classifiers need a logistic function. I found this ticket, but I’m not sure why it’s closed: https://issues.apache.org/jira/browse/SPARK-3585 I think LogisticRegression and RandomForest should have the same name for the predict_proba function. I would just call it that, since then at least PySpark is consistent with sklearn library. Internally logistic function should be implemented as a single function, not hard-coded in multiple places the way that it is now. That’s another ticket. Aside: I haven’t looked at LogisticRegressionWithSGD, but it fails horribly sometimes: algo either diverges or gets stuck in local minima. Stabilize DecisionTree and ensembles APIs - Key: SPARK-6113 URL: https://issues.apache.org/jira/browse/SPARK-6113 Project: Spark Issue Type: Sub-task Components: MLlib, PySpark Affects Versions: 1.4.0 Reporter: Joseph K. Bradley Assignee: Joseph K. Bradley Priority: Critical *Issue*: The APIs for DecisionTree and ensembles (RandomForests and GradientBoostedTrees) have been experimental for a long time. The API has become very convoluted because trees and ensembles have many, many variants, some of which we have added incrementally without a long-term design. *Proposal*: This JIRA is for discussing changes required to finalize the APIs. After we discuss, I will make a PR to update the APIs and make them non-Experimental. This will require making many breaking changes; see the design doc for details. [Design doc | https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]: This outlines current issues and the proposed API. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906 ] Max Kaznady commented on SPARK-3727: Yes, probabilities have to be added to other models too, like LogisticRegression. Right now they are hardcoded in two places but not outputted in PySpark. I think is makes sense to split into PySpark, then classification, then probabilities, and then group different types of algorithms, all of which output probabilities: Logistic Regression, Random Forest, etc. Can also add probabilities for trees by counting the number of leaf 1's and 0's. What do you think? DecisionTree, RandomForest: More prediction functionality - Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492904#comment-14492904 ] Joseph K. Bradley commented on SPARK-6884: -- I'd recommend: Under spark.ml, have RandomForestClassifier (currently being added) extend ProbabilisticClassifier. Random forest: predict class probabilities -- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3727) Trees and ensembles: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-3727: - Summary: Trees and ensembles: More prediction functionality (was: DecisionTree, RandomForest: More prediction functionality) Trees and ensembles: More prediction functionality -- Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492909#comment-14492909 ] Ilya Ganelin commented on SPARK-6703: - Patrick - what¹s the time line for the 1.4 release? Just want to have a sense for it so I can schedule accordingly. Thank you, Ilya Ganelin The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492867#comment-14492867 ] Joseph K. Bradley commented on SPARK-6682: -- Do you mean (a) tests to make sure the examples work, or (b) treating the examples as tests themselves? We should not do (b) since it mixes tests and examples. For (a), we don't have a great solution currently, although I think we should (at some point) add a script for running all of the examples to make sure they run. I don't think we need performance tests for examples since they are meant to be short usage examples, not end solutions or applications. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
[ https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492892#comment-14492892 ] Joseph K. Bradley commented on SPARK-6884: -- Is this not a duplicate of [SPARK-3727]? Perhaps the best way to split up the work will be to make a subtask for trees, and a separate subtask for ensembles. I'll go ahead and do that. random forest predict probabilities functionality (like in sklearn) --- Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.3.0 Environment: cross-platform Reporter: Max Kaznady Labels: prediction, probability, randomforest, tree Original Estimate: 72h Remaining Estimate: 72h Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6885) Decision trees: predict class probabilities
Joseph K. Bradley created SPARK-6885: Summary: Decision trees: predict class probabilities Key: SPARK-6885 URL: https://issues.apache.org/jira/browse/SPARK-6885 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Under spark.ml, have DecisionTreeClassifier (currently being added) extend ProbabilisticClassifier. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5988) Model import/export for PowerIterationClusteringModel
[ https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-5988. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5450 [https://github.com/apache/spark/pull/5450] Model import/export for PowerIterationClusteringModel - Key: SPARK-5988 URL: https://issues.apache.org/jira/browse/SPARK-5988 Project: Spark Issue Type: Sub-task Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Xusen Yin Fix For: 1.4.0 Add save/load for PowerIterationClusteringModel -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)
Max Kaznady created SPARK-6884: -- Summary: random forest predict probabilities functionality (like in sklearn) Key: SPARK-6884 URL: https://issues.apache.org/jira/browse/SPARK-6884 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.4.0 Environment: cross-platform Reporter: Max Kaznady Currently, there is no way to extract the class probabilities from the RandomForest classifier. I implemented a probability predictor by counting votes from individual trees and adding up their votes for 1 and then dividing by the total number of votes. I opened this ticked to keep track of changes. Will update once I push my code to master. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492931#comment-14492931 ] Joseph K. Bradley commented on SPARK-3727: -- [~maxkaznady] Implementations should be done in Scala; the PySpark API will be a wrapper. The API update JIRA I'm referencing should clear up some of the other questions. Trees and ensembles: More prediction functionality -- Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6887) ColumnBuilder misses FloatType
[ https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yin Huai updated SPARK-6887: Description: To reproduce ... {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(StructField(c, FloatType, true) :: Nil) val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat)) sqlContext.createDataFrame(rdd, schema).registerTempTable(test) sqlContext.sql(cache table test) sqlContext.table(test).show {code} The exception is ... {code} 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at SparkPlan.scala:88, took 0.474392 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292) at org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539) at org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54) at org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} ColumnBuilder misses FloatType -- Key: SPARK-6887 URL: https://issues.apache.org/jira/browse/SPARK-6887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.4.0 To reproduce ... {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(StructField(c, FloatType, true) :: Nil) val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat)) sqlContext.createDataFrame(rdd, schema).registerTempTable(test) sqlContext.sql(cache table test) sqlContext.table(test).show {code} The exception is ... {code} 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at SparkPlan.scala:88, took 0.474392 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292) at
[jira] [Assigned] (SPARK-6887) ColumnBuilder misses FloatType
[ https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6887: --- Assignee: Apache Spark (was: Yin Huai) ColumnBuilder misses FloatType -- Key: SPARK-6887 URL: https://issues.apache.org/jira/browse/SPARK-6887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Apache Spark Fix For: 1.4.0 To reproduce ... {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(StructField(c, FloatType, true) :: Nil) val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat)) sqlContext.createDataFrame(rdd, schema).registerTempTable(test) sqlContext.sql(cache table test) sqlContext.table(test).show {code} The exception is ... {code} 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at SparkPlan.scala:88, took 0.474392 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292) at org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539) at org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54) at org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493193#comment-14493193 ] Reynold Xin commented on SPARK-6865: As discussed offline, it would makes more sense to go with option 1, i.e. - str is treated as a quoted identifier in SQL, equivalent to `str`. - * is a special case in which it refers to all the columns in a data frame. (Note that this means we cannot have a column named *, which I think is fine.) The reason is that strings are already quoted, and programmers expect them to be quoted literals without extra escaping. We will need to fix our resolver with respect to dots. Decide on semantics for string identifiers in DataFrame API --- Key: SPARK-6865 URL: https://issues.apache.org/jira/browse/SPARK-6865 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker There are two options: - Quoted Identifiers: meaning that the strings are treated as though they were in backticks in SQL. Any weird characters (spaces, or, etc) are considered part of the identifier. Kind of weird given that `*` is already a special identifier explicitly allowed by the API - Unquoted parsed identifiers: would allow users to specify things like tableAlias.* However, would also require explicit use of `backticks` for identifiers with weird characters in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mandar Chandorkar updated SPARK-4638: - Attachment: kernels-1.3.patch Patch for the kernels implementation taken against the current branch-1.3 of apache spark Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5972) Cache residuals for GradientBoostedTrees during training
[ https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-5972. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5330 [https://github.com/apache/spark/pull/5330] Cache residuals for GradientBoostedTrees during training Key: SPARK-5972 URL: https://issues.apache.org/jira/browse/SPARK-5972 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley Assignee: Manoj Kumar Priority: Minor Fix For: 1.4.0 In gradient boosting, the current model's prediction is re-computed for each training instance on every iteration. The current residual (cumulative prediction of previously trained trees in the ensemble) should be cached. That could reduce both computation (only computing the prediction of the most recently trained tree) and communication (only sending the most recently trained tree to the workers). -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183 ] Patrick Wendell edited comment on SPARK-6511 at 4/13/15 10:11 PM: -- Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? /cc [~srowen] was (Author: pwendell): Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6886) Big closure in PySpark will fail during shuffle
Davies Liu created SPARK-6886: - Summary: Big closure in PySpark will fail during shuffle Key: SPARK-6886 URL: https://issues.apache.org/jira/browse/SPARK-6886 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.3.0, 1.2.1, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Reported by beifei.zhou beifei.zhou at ximalaya.com: I am using spark to process bid datasets. However, there is always problem when executing reduceByKey on a large dataset, whereas with a smaller dataset. May I asked you how could I solve this issue? The error is always like this: {code} 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in main command = pickleSer.loads(command.value) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, in value self._value = self.load(self._path) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in load with open(path, 'rb', 1 20) as f: IOError: [Errno 2] No such file or directory: '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG' {code} Here I attach my code: {code} import codecs from pyspark import SparkContext, SparkConf from operator import add import operator from pyspark.storagelevel import StorageLevel def combine_dict(a,b): a.update(b) return a conf = SparkConf() sc = SparkContext(appName = tag) al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: combine_dict(a,b)) result = sc.textFile('uidAlbumscore.txt')\ .map(lambda x: x.split(','))\ .filter(lambda x: x[1] in al_tag_dict.keys())\ .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\ .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\ .flatMap(lambda x: x)\ .map(lambda x: (str(x[0][0]), x[1]))\ .reduceByKey(add)\ #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\ #.reduce(add) #codecs.open('tag_score.txt','w','utf-8').write(result) print result.first() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6130) support if not exists for insert overwrite into partition in hiveQl
[ https://issues.apache.org/jira/browse/SPARK-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6130. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4865 [https://github.com/apache/spark/pull/4865] support if not exists for insert overwrite into partition in hiveQl --- Key: SPARK-6130 URL: https://issues.apache.org/jira/browse/SPARK-6130 Project: Spark Issue Type: New Feature Components: SQL Reporter: Adrian Wang Fix For: 1.4.0 Standard syntax: INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...; FROM from_statement INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; Hive extension (dynamic partition inserts): INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6886) Big closure in PySpark will fail during shuffle
[ https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6886: --- Assignee: Davies Liu (was: Apache Spark) Big closure in PySpark will fail during shuffle --- Key: SPARK-6886 URL: https://issues.apache.org/jira/browse/SPARK-6886 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1, 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Reported by beifei.zhou beifei.zhou at ximalaya.com: I am using spark to process bid datasets. However, there is always problem when executing reduceByKey on a large dataset, whereas with a smaller dataset. May I asked you how could I solve this issue? The error is always like this: {code} 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in main command = pickleSer.loads(command.value) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, in value self._value = self.load(self._path) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in load with open(path, 'rb', 1 20) as f: IOError: [Errno 2] No such file or directory: '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG' {code} Here I attach my code: {code} import codecs from pyspark import SparkContext, SparkConf from operator import add import operator from pyspark.storagelevel import StorageLevel def combine_dict(a,b): a.update(b) return a conf = SparkConf() sc = SparkContext(appName = tag) al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: combine_dict(a,b)) result = sc.textFile('uidAlbumscore.txt')\ .map(lambda x: x.split(','))\ .filter(lambda x: x[1] in al_tag_dict.keys())\ .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\ .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\ .flatMap(lambda x: x)\ .map(lambda x: (str(x[0][0]), x[1]))\ .reduceByKey(add)\ #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\ #.reduce(add) #codecs.open('tag_score.txt','w','utf-8').write(result) print result.first() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6886) Big closure in PySpark will fail during shuffle
[ https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493068#comment-14493068 ] Apache Spark commented on SPARK-6886: - User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/5496 Big closure in PySpark will fail during shuffle --- Key: SPARK-6886 URL: https://issues.apache.org/jira/browse/SPARK-6886 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1, 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Davies Liu Priority: Blocker Reported by beifei.zhou beifei.zhou at ximalaya.com: I am using spark to process bid datasets. However, there is always problem when executing reduceByKey on a large dataset, whereas with a smaller dataset. May I asked you how could I solve this issue? The error is always like this: {code} 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in main command = pickleSer.loads(command.value) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, in value self._value = self.load(self._path) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in load with open(path, 'rb', 1 20) as f: IOError: [Errno 2] No such file or directory: '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG' {code} Here I attach my code: {code} import codecs from pyspark import SparkContext, SparkConf from operator import add import operator from pyspark.storagelevel import StorageLevel def combine_dict(a,b): a.update(b) return a conf = SparkConf() sc = SparkContext(appName = tag) al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: combine_dict(a,b)) result = sc.textFile('uidAlbumscore.txt')\ .map(lambda x: x.split(','))\ .filter(lambda x: x[1] in al_tag_dict.keys())\ .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\ .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\ .flatMap(lambda x: x)\ .map(lambda x: (str(x[0][0]), x[1]))\ .reduceByKey(add)\ #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\ #.reduce(add) #codecs.open('tag_score.txt','w','utf-8').write(result) print result.first() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6886) Big closure in PySpark will fail during shuffle
[ https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6886: --- Assignee: Apache Spark (was: Davies Liu) Big closure in PySpark will fail during shuffle --- Key: SPARK-6886 URL: https://issues.apache.org/jira/browse/SPARK-6886 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 1.2.1, 1.3.0, 1.4.0 Reporter: Davies Liu Assignee: Apache Spark Priority: Blocker Reported by beifei.zhou beifei.zhou at ximalaya.com: I am using spark to process bid datasets. However, there is always problem when executing reduceByKey on a large dataset, whereas with a smaller dataset. May I asked you how could I solve this issue? The error is always like this: {code} 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5) org.apache.spark.api.python.PythonException: Traceback (most recent call last): File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in main command = pickleSer.loads(command.value) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, in value self._value = self.load(self._path) File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in load with open(path, 'rb', 1 20) as f: IOError: [Errno 2] No such file or directory: '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG' {code} Here I attach my code: {code} import codecs from pyspark import SparkContext, SparkConf from operator import add import operator from pyspark.storagelevel import StorageLevel def combine_dict(a,b): a.update(b) return a conf = SparkConf() sc = SparkContext(appName = tag) al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: combine_dict(a,b)) result = sc.textFile('uidAlbumscore.txt')\ .map(lambda x: x.split(','))\ .filter(lambda x: x[1] in al_tag_dict.keys())\ .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\ .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\ .flatMap(lambda x: x)\ .map(lambda x: (str(x[0][0]), x[1]))\ .reduceByKey(add)\ #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\ #.reduce(add) #codecs.open('tag_score.txt','w','utf-8').write(result) print result.first() {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6368: --- Assignee: Yin Huai (was: Apache Spark) Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6368: --- Assignee: Apache Spark (was: Yin Huai) Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Apache Spark Priority: Critical Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.
[ https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493092#comment-14493092 ] Apache Spark commented on SPARK-6368: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5497 Build a specialized serializer for Exchange operator. -- Key: SPARK-6368 URL: https://issues.apache.org/jira/browse/SPARK-6368 Project: Spark Issue Type: Improvement Components: SQL Reporter: Yin Huai Assignee: Yin Huai Priority: Critical Kryo is still pretty slow because it works on individual objects and relative expensive to allocate. For Exchange operator, because the schema for key and value are already defined, we can create a specialized serializer to handle the specific schemas of key and value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493179#comment-14493179 ] Patrick Wendell commented on SPARK-6703: Yes, ideally we get it into 1.4 - though I think the ultimate solution here could be a very small patch. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493180#comment-14493180 ] Sean Owen commented on SPARK-4638: -- [~mandar2812] Spark does not use patches in JIRA but uses pull requests. Also changes should be vs master, not a branch. https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5632: --- Issue Type: Sub-task (was: Bug) Parent: SPARK-6116 not able to resolve dot('.') in field name -- Key: SPARK-5632 URL: https://issues.apache.org/jira/browse/SPARK-5632 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 1.2.0, 1.3.0 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 Reporter: Lishu Liu Priority: Blocker My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: {code} scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity {code} The full schema look like this: {code} scala task_trace.printSchema() root \|-- received_datetime: long (nullable = true) \|-- task_body: struct (nullable = true) \|\|-- cerberus_batch_id: string (nullable = true) \|\|-- cerberus_id: string (nullable = true) \|\|-- couponId: integer (nullable = true) \|\|-- coupon_code: string (nullable = true) \|\|-- created: string (nullable = true) \|\|-- description: string (nullable = true) \|\|-- domain: string (nullable = true) \|\|-- expires: string (nullable = true) \|\|-- message_id: string (nullable = true) \|\|-- neverShowAfter: string (nullable = true) \|\|-- neverShowBefore: string (nullable = true) \|\|-- offerTitle: string (nullable = true) \|\|-- screenshots: array (nullable = true) \|\|\|-- element: string (containsNull = false) \|\|-- sm.result: struct (nullable = true) \|\|\|-- cerberus_batch_id: string (nullable = true) \|\|\|-- cerberus_id: string (nullable = true) \|\|\|-- code: string (nullable = true) \|\|\|-- couponId: integer (nullable = true) \|\|\|-- created: string (nullable = true) \|\|\|-- description: string (nullable = true) \|\|\|-- domain: string (nullable = true) \|\|\|-- expires: string (nullable = true) \|\|\|-- message_id: string (nullable = true) \|\|\|-- neverShowAfter: string (nullable = true) \|\|\|-- neverShowBefore: string (nullable = true) \|\|\|-- offerTitle: string (nullable = true) \|\|\|-- result: struct (nullable = true) \|\|\|\|-- post: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: boolean (nullable = true) \|\|\|\|\|-- meta: struct (nullable = true) \|\|\|\|\|\|-- None_tx_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- exceptions: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- no_input_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_mapped: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_transformed: array (nullable = true) \|\|\|\|\|\|\|-- element: array (containsNull = false) \|\|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|-- now_price_checkout: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- shipping_price: struct (nullable = true) \|\|\|\|\|\|-- ci: double
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183 ] Patrick Wendell commented on SPARK-6511: Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is how I got it running after doing a hadoop-provided build. This is pretty clunky, so I wonder if we should just support setting HADOOP_HOME or something and we can automatically find and add the jar files present within that folder. {code} export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;) ./bin/spark-shell {code} [~vanzin] for your CDH packages, what do you end up setting SPARK_DIST_CLASSPATH to? Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493200#comment-14493200 ] Sean Owen commented on SPARK-6511: -- Yeah that might be the fastest way to find all the jars at once. They occur in various places in the raw Hadoop distro. That's really not too bad, this one liner. I don't know if it's so great to start then also modifying the classpath based on HADOOP_HOME as this might not be what the end user wants or interfere with an explicitly configured classpath. In something like CDH they're all laid out in one directory, per components, so are easier to find, but that isn't much different. I don't see that the distro sets SPARK_DIST_CLASSPATH but sets things like SPARK_LIBRARY_PATH in spark-env.sh to ${SPARK_HOME}/lib. I actually don't see where the Hadoop deps come in but it is going to be something similar. The effect is about the same, to add all of the Hadoop client and YARN jars to the classpath too. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thomas Graves reassigned SPARK-1701: Assignee: Thomas Graves Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Assignee: Thomas Graves Priority: Minor Labels: starter Fix For: 1.2.0 Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns
[ https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6742. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5390 [https://github.com/apache/spark/pull/5390] Spark pushes down filters in old parquet path that reference partitioning columns - Key: SPARK-6742 URL: https://issues.apache.org/jira/browse/SPARK-6742 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.1 Reporter: Yash Datta Fix For: 1.4.0 Create a table with multiple fields partitioned on 'market' column. run a query like : SELECT start_sp_time, end_sp_time, imsi, imei, enb_common_enbid FROM csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND start_sp_time = 1.4158368E9 AND end_sp_time 1.4159232E9 AND dt = '2014-11-13-00-00' AND dt '2014-11-14-00-00' ORDER BY end_sp_time DESC LIMIT 100 The or filter is pushed down in this case , resulting in column not found exception from parquet -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name
[ https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin updated SPARK-5632: --- Priority: Blocker (was: Major) Target Version/s: 1.4.0 Affects Version/s: 1.3.0 not able to resolve dot('.') in field name -- Key: SPARK-5632 URL: https://issues.apache.org/jira/browse/SPARK-5632 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.2.0, 1.3.0 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2 Reporter: Lishu Liu Priority: Blocker My cassandra table task_trace has a field sm.result which contains dot in the name. So SQL tried to look up sm instead of full name 'sm.result'. Here is my code: {code} scala import org.apache.spark.sql.cassandra.CassandraSQLContext scala val cc = new CassandraSQLContext(sc) scala val task_trace = cc.jsonFile(/task_trace.json) scala task_trace.registerTempTable(task_trace) scala cc.setKeyspace(cerberus_data_v4) scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, task_body.sm.result FROM task_trace WHERE task_id = 'fff7304e-9984-4b45-b10c-0423a96745ce') res: org.apache.spark.sql.SchemaRDD = SchemaRDD[57] at RDD at SchemaRDD.scala:108 == Query Plan == == Physical Plan == java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, cerberus_id, couponId, coupon_code, created, description, domain, expires, message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, validity {code} The full schema look like this: {code} scala task_trace.printSchema() root \|-- received_datetime: long (nullable = true) \|-- task_body: struct (nullable = true) \|\|-- cerberus_batch_id: string (nullable = true) \|\|-- cerberus_id: string (nullable = true) \|\|-- couponId: integer (nullable = true) \|\|-- coupon_code: string (nullable = true) \|\|-- created: string (nullable = true) \|\|-- description: string (nullable = true) \|\|-- domain: string (nullable = true) \|\|-- expires: string (nullable = true) \|\|-- message_id: string (nullable = true) \|\|-- neverShowAfter: string (nullable = true) \|\|-- neverShowBefore: string (nullable = true) \|\|-- offerTitle: string (nullable = true) \|\|-- screenshots: array (nullable = true) \|\|\|-- element: string (containsNull = false) \|\|-- sm.result: struct (nullable = true) \|\|\|-- cerberus_batch_id: string (nullable = true) \|\|\|-- cerberus_id: string (nullable = true) \|\|\|-- code: string (nullable = true) \|\|\|-- couponId: integer (nullable = true) \|\|\|-- created: string (nullable = true) \|\|\|-- description: string (nullable = true) \|\|\|-- domain: string (nullable = true) \|\|\|-- expires: string (nullable = true) \|\|\|-- message_id: string (nullable = true) \|\|\|-- neverShowAfter: string (nullable = true) \|\|\|-- neverShowBefore: string (nullable = true) \|\|\|-- offerTitle: string (nullable = true) \|\|\|-- result: struct (nullable = true) \|\|\|\|-- post: struct (nullable = true) \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: boolean (nullable = true) \|\|\|\|\|-- meta: struct (nullable = true) \|\|\|\|\|\|-- None_tx_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- exceptions: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- no_input_value: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_mapped: array (nullable = true) \|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|\|-- not_transformed: array (nullable = true) \|\|\|\|\|\|\|-- element: array (containsNull = false) \|\|\|\|\|\|\|\|-- element: string (containsNull = false) \|\|\|\|\|-- now_price_checkout: struct (nullable = true) \|\|\|\|\|\|-- ci: double (nullable = true) \|\|\|\|\|\|-- value: double (nullable = true) \|\|\|\|\|-- shipping_price: struct (nullable = true) \|\|\|
[jira] [Comment Edited] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493193#comment-14493193 ] Reynold Xin edited comment on SPARK-6865 at 4/13/15 10:26 PM: -- As discussed offline, it would makes more sense to go with option 1, i.e. - str is treated as a quoted identifier in SQL, equivalent to `str`. - * is a special case in which it refers to all the columns in a data frame. (Note that this means we cannot have a column named *, which I think is fine.) The reason is that strings are already quoted, and programmers expect them to be quoted literals without extra escaping. We will need to fix our resolver with respect to dots. was (Author: rxin): As discussed offline, it would makes more sense to go with option 1, i.e. - str is treated as a quoted identifier in SQL, equivalent to `str`. - * is a special case in which it refers to all the columns in a data frame. (Note that this means we cannot have a column named *, which I think is fine.) The reason is that strings are already quoted, and programmers expect them to be quoted literals without extra escaping. We will need to fix our resolver with respect to dots. Decide on semantics for string identifiers in DataFrame API --- Key: SPARK-6865 URL: https://issues.apache.org/jira/browse/SPARK-6865 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Priority: Blocker There are two options: - Quoted Identifiers: meaning that the strings are treated as though they were in backticks in SQL. Any weird characters (spaces, or, etc) are considered part of the identifier. Kind of weird given that `*` is already a special identifier explicitly allowed by the API - Unquoted parsed identifiers: would allow users to specify things like tableAlias.* However, would also require explicit use of `backticks` for identifiers with weird characters in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API
[ https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Reynold Xin resolved SPARK-6865. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Reynold Xin This is now decided. Decide on semantics for string identifiers in DataFrame API --- Key: SPARK-6865 URL: https://issues.apache.org/jira/browse/SPARK-6865 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Michael Armbrust Assignee: Reynold Xin Priority: Blocker Fix For: 1.4.0 There are two options: - Quoted Identifiers: meaning that the strings are treated as though they were in backticks in SQL. Any weird characters (spaces, or, etc) are considered part of the identifier. Kind of weird given that `*` is already a special identifier explicitly allowed by the API - Unquoted parsed identifiers: would allow users to specify things like tableAlias.* However, would also require explicit use of `backticks` for identifiers with weird characters in them. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-2873) Support disk spilling in Spark SQL aggregation / join
[ https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust updated SPARK-2873: Priority: Blocker (was: Major) Support disk spilling in Spark SQL aggregation / join - Key: SPARK-2873 URL: https://issues.apache.org/jira/browse/SPARK-2873 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.0.0 Reporter: guowei Priority: Blocker -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition
[ https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493147#comment-14493147 ] Nicholas Chammas commented on SPARK-1701: - [~tgraves] - Shouldn't this issue be assigned to [~farrellee]? Inconsistent naming: slice or partition --- Key: SPARK-1701 URL: https://issues.apache.org/jira/browse/SPARK-1701 Project: Spark Issue Type: Improvement Components: Documentation, Spark Core Reporter: Daniel Darabos Assignee: Thomas Graves Priority: Minor Labels: starter Fix For: 1.2.0 Throughout the documentation and code slice and partition are used interchangeably. (Or so it seems to me.) It would avoid some confusion for new users to settle on one name. I think partition is winning, since that is the name of the class representing the concept. This should not be much more complicated to do than a search replace. I can take a stab at it, if you agree. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries
[ https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493174#comment-14493174 ] Mandar Chandorkar edited comment on SPARK-4638 at 4/13/15 10:05 PM: Patch for the kernels implementation taken against the current branch-1.3 of apache spark [~amuise]: The patch as you requested, if there are any problems with it or if you need anything else, I will do my best to supply it. Thank you. was (Author: mandar2812): Patch for the kernels implementation taken against the current branch-1.3 of apache spark Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries --- Key: SPARK-4638 URL: https://issues.apache.org/jira/browse/SPARK-4638 Project: Spark Issue Type: New Feature Components: MLlib Reporter: madankumar s Labels: Gaussian, Kernels, SVM Attachments: kernels-1.3.patch SPARK MLlib Classification Module: Add Kernel functionalities to SVM Classifier to find non linear patterns -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6887) ColumnBuilder misses FloatType
Yin Huai created SPARK-6887: --- Summary: ColumnBuilder misses FloatType Key: SPARK-6887 URL: https://issues.apache.org/jira/browse/SPARK-6887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6887) ColumnBuilder misses FloatType
[ https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6887: --- Assignee: Yin Huai (was: Apache Spark) ColumnBuilder misses FloatType -- Key: SPARK-6887 URL: https://issues.apache.org/jira/browse/SPARK-6887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.4.0 To reproduce ... {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(StructField(c, FloatType, true) :: Nil) val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat)) sqlContext.createDataFrame(rdd, schema).registerTempTable(test) sqlContext.sql(cache table test) sqlContext.table(test).show {code} The exception is ... {code} 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at SparkPlan.scala:88, took 0.474392 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292) at org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539) at org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54) at org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6887) ColumnBuilder misses FloatType
[ https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493186#comment-14493186 ] Apache Spark commented on SPARK-6887: - User 'yhuai' has created a pull request for this issue: https://github.com/apache/spark/pull/5499 ColumnBuilder misses FloatType -- Key: SPARK-6887 URL: https://issues.apache.org/jira/browse/SPARK-6887 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 1.4.0 Reporter: Yin Huai Assignee: Yin Huai Fix For: 1.4.0 To reproduce ... {code} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val schema = StructType(StructField(c, FloatType, true) :: Nil) val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat)) sqlContext.createDataFrame(rdd, schema).registerTempTable(test) sqlContext.sql(cache table test) sqlContext.table(test).show {code} The exception is ... {code} 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at SparkPlan.scala:88, took 0.474392 s org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost): java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableLong at org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292) at org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539) at org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54) at org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52) at org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295) at org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290) at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130) at org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277) at org.apache.spark.rdd.RDD.iterator(RDD.scala:244) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:64) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6820) Convert NAs to null type in SparkR DataFrames
[ https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493223#comment-14493223 ] Antonio Piccolboni commented on SPARK-6820: --- For the distinction between NAs and NUlls in R, see http://www.r-bloggers.com/r-na-vs-null/ This seems a fairly dangerous move, but I don't have a good alternative to suggest. This is a valid data frame dd - structure(list(c.1..2..NA. = c(1, 2, NA), V2 = list(1, 2, NULL)), .Names = c(c.1..2..NA., V2), row.names = c(NA, -3L), class = data.frame) dd[3,1] == dd[3,2][[1]] How often real code relies on list columns that can contain nulls, I am not sure. Convert NAs to null type in SparkR DataFrames - Key: SPARK-6820 URL: https://issues.apache.org/jira/browse/SPARK-6820 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman While converting RDD or local R DataFrame to a SparkR DataFrame we need to handle missing values or NAs. We should convert NAs to SparkSQL's null type to handle the conversion correctly -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493252#comment-14493252 ] Sean Owen commented on SPARK-6889: -- For those that would like to comment directly on the two documents in an easier-to-user interface, they are available at: SparkProjectMechanicsChallenges https://docs.google.com/document/d/1eV7hWvVPLuZEtvjl72_qYx1iraYKuPzoZWFxtdL99QI/edit?usp=sharing ContributingToSpark https://docs.google.com/document/d/1tB9-f9lmxhC32QlOo4E8Z7eGDwHx1_Q3O8uCmRXQTo8/edit?usp=sharing But you can comment here too, and I will eventually update the PDFs if needed so that the latest discussion can be seen here promptly. Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493303#comment-14493303 ] Evan Chan commented on SPARK-6703: -- Hey folks, Thought I would just put in my 2 cents as the author of the Spark Jobserver. What is the envisioned way for multiple applications to share the same SparkContext? Code has to be running in the same JVM, and for most applications there already must exist some shared knowledge of the framework or environment. This will affect whether this feature is useful or not. For example, the Spark Jobserver requires jobs to implement an interface, and also manages creation of the SparkContext. That way, jobs get the SparkContext through a method call, and we can have other method calls to do things like input validation. What I'm saying is that this feature would have little existing value to job server users, as jobs in job server already have a way to discover the existing context, and to implement a good RESTful API, for example. Another thing to think about is what about SQLContext, HiveContext. I realize there is the JDBC server, but in job server we have a way to pass in alternative forms of the contexts. I suppose you could then add this method to a static SQLContext singleton as well. Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint
[ https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shivaram Venkataraman resolved SPARK-6881. -- Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5493 [https://github.com/apache/spark/pull/5493] Change the checkpoint directory name from checkpoints to checkpoint --- Key: SPARK-6881 URL: https://issues.apache.org/jira/browse/SPARK-6881 Project: Spark Issue Type: Improvement Components: SparkR Affects Versions: 1.4.0 Reporter: Hao Priority: Trivial Fix For: 1.4.0 Name checkpoint instead of checkpoints is included in .gitignore -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493274#comment-14493274 ] Patrick Wendell commented on SPARK-6511: Can we just run HADOOP_HOME/bin/hadoop classpath and then capture the result? I'm wondering if there is a standard interface here we can expect most Hadoop distributions to have. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6890) Local cluster mode in Mac is broken
Davies Liu created SPARK-6890: - Summary: Local cluster mode in Mac is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Davies Liu Assignee: Andrew Or Priority: Blocker The worker can not be launched, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
[ https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491881#comment-14491881 ] Littlestar edited comment on SPARK-6151 at 4/14/15 1:04 AM: The HDFS Block Size is set once when you first install Hadoop. blockSize can be changed when File create, but spark has no way to change blockSize. FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize) throws IOException was (Author: cnstar9988): The HDFS Block Size is set once when you first install Hadoop. blockSize can be changed when File create. FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean overwrite, int bufferSize, short replication, long blockSize) throws IOException schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size --- Key: SPARK-6151 URL: https://issues.apache.org/jira/browse/SPARK-6151 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 1.2.1 Reporter: Littlestar Priority: Trivial How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size. may be Configuration need. related question by others. http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6877) Add code generation support for Min
[ https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6877. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5487 [https://github.com/apache/spark/pull/5487] Add code generation support for Min --- Key: SPARK-6877 URL: https://issues.apache.org/jira/browse/SPARK-6877 Project: Spark Issue Type: New Feature Components: SQL Reporter: Liang-Chi Hsieh Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493448#comment-14493448 ] Andrew Or commented on SPARK-6890: -- I'm not actively working on this. Feel free to fix it since you and Nishkam have more experience in that part of the code. Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6891) ExecutorAllocationManager will request negative number executors
meiyoula created SPARK-6891: --- Summary: ExecutorAllocationManager will request negative number executors Key: SPARK-6891 URL: https://issues.apache.org/jira/browse/SPARK-6891 Project: Spark Issue Type: Bug Components: Spark Core Reporter: meiyoula Priority: Critical Below is the exception: 15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -1 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Below is the configurations I setted: spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 0 spark.dynamicAllocation.initialExecutors3 spark.dynamicAllocation.maxExecutors7 spark.dynamicAllocation.executorIdleTimeout 30 spark.shuffle.service.enabled true -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493254#comment-14493254 ] Patrick Wendell commented on SPARK-6889: Thanks for posting this Sean. Overall, I think this is a big improvement. Some comments on the proposed JIRA workflow changes: 1. I think logically Affects Version/s is required only for bugs, right? Is there a well defined meaning for Affects Version/s for a new feature that is distinct from Target Version/s? 2. I am not sure you can restrict certain priority levels to certain roles, but if so that would be really nice. Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-5931) Use consistent naming for time properties
[ https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-5931: - Assignee: Ilya Ganelin (was: Andrew Or) Use consistent naming for time properties - Key: SPARK-5931 URL: https://issues.apache.org/jira/browse/SPARK-5931 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Ilya Ganelin Fix For: 1.4.0 This is SPARK-5932's sister issue. The naming of existing time configs is inconsistent. We currently have the following throughout the code base: {code} spark.network.timeout // seconds spark.executor.heartbeatInterval // milliseconds spark.storage.blockManagerSlaveTimeoutMs // milliseconds spark.yarn.scheduler.heartbeat.interval-ms // milliseconds {code} Instead, my proposal is to simplify the config name itself and make everything accept time using the following format: 5s, 2ms, 100us. For instance: {code} spark.network.timeout = 5s spark.executor.heartbeatInterval = 500ms spark.storage.blockManagerSlaveTimeout = 100ms spark.yarn.scheduler.heartbeatInterval = 400ms {code} All existing configs that are relevant will be deprecated in favor of the new ones. We should do this soon before we keep introducing more time configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493270#comment-14493270 ] Yu Gao commented on SPARK-5111: --- Hi Zhan, which spark version is going to have this fix? We ran into the same issue with Hadoop 2.6 + Kerberos, so would like to see this fixed in Spark. Thanks. HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5931) Use consistent naming for time properties
[ https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-5931. Resolution: Fixed Fix Version/s: 1.4.0 Use consistent naming for time properties - Key: SPARK-5931 URL: https://issues.apache.org/jira/browse/SPARK-5931 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 1.0.0 Reporter: Andrew Or Assignee: Ilya Ganelin Fix For: 1.4.0 This is SPARK-5932's sister issue. The naming of existing time configs is inconsistent. We currently have the following throughout the code base: {code} spark.network.timeout // seconds spark.executor.heartbeatInterval // milliseconds spark.storage.blockManagerSlaveTimeoutMs // milliseconds spark.yarn.scheduler.heartbeat.interval-ms // milliseconds {code} Instead, my proposal is to simplify the config name itself and make everything accept time using the following format: 5s, 2ms, 100us. For instance: {code} spark.network.timeout = 5s spark.executor.heartbeatInterval = 500ms spark.storage.blockManagerSlaveTimeout = 100ms spark.yarn.scheduler.heartbeatInterval = 400ms {code} All existing configs that are relevant will be deprecated in favor of the new ones. We should do this soon before we keep introducing more time configs. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6703: --- Assignee: Apache Spark (was: Ilya Ganelin) Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Apache Spark Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493306#comment-14493306 ] Apache Spark commented on SPARK-6703: - User 'ilganeli' has created a pull request for this issue: https://github.com/apache/spark/pull/5501 Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6703) Provide a way to discover existing SparkContext's
[ https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6703: --- Assignee: Ilya Ganelin (was: Apache Spark) Provide a way to discover existing SparkContext's - Key: SPARK-6703 URL: https://issues.apache.org/jira/browse/SPARK-6703 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 1.3.0 Reporter: Patrick Wendell Assignee: Ilya Ganelin Priority: Critical Right now it is difficult to write a Spark application in a way that can be run independently and also be composed with other Spark applications in an environment such as the JobServer, notebook servers, etc where there is a shared SparkContext. It would be nice to provide a rendez-vous point so that applications can learn whether an existing SparkContext already exists before creating one. The most simple/surgical way I see to do this is to have an optional static SparkContext singleton that people can be retrieved as follows: {code} val sc = SparkContext.getOrCreate(conf = new SparkConf()) {code} And you could also have a setter where some outer framework/server can set it for use by multiple downstream applications. A more advanced version of this would have some named registry or something, but since we only support a single SparkContext in one JVM at this point anyways, this seems sufficient and much simpler. Another advanced option would be to allow plugging in some other notion of configuration you'd pass when retrieving an existing context. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6890) Local cluster mode in Mac is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6890: - Affects Version/s: 1.4.0 Local cluster mode in Mac is broken --- Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Blocker The worker can not be launched, -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6890: - Assignee: Marcelo Vanzin (was: Andrew Or) Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Marcelo Vanzin Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493447#comment-14493447 ] Marcelo Vanzin commented on SPARK-6890: --- Also, another possible way to fix this is to pass the location of the assembly jar to the {{Main}} class, instead of the current code. That was my original suggestion when Nishkam was working on this. It makes the code a little uglier (to allow for plumbing that path through the code), but it would allow maintaining the behavior added by that patch while probably fixing this issue. Let me know if you're working on this, Andrew, otherwise I can do that. Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6890: --- Assignee: Apache Spark (was: Marcelo Vanzin) Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Apache Spark Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493478#comment-14493478 ] Apache Spark commented on SPARK-6890: - User 'vanzin' has created a pull request for this issue: https://github.com/apache/spark/pull/5504 Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Marcelo Vanzin Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-6890: --- Assignee: Marcelo Vanzin (was: Apache Spark) Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Marcelo Vanzin Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
[ https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493483#comment-14493483 ] Zhan Zhang commented on SPARK-5111: --- [~crystal_gaoyu] I am not sure. You may try to patch the spark by yourself and give it a try. HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5 --- Key: SPARK-5111 URL: https://issues.apache.org/jira/browse/SPARK-5111 Project: Spark Issue Type: Bug Components: SQL Reporter: Zhan Zhang Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 support in spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-6888) Make DriverQuirks editable
Rene Treffer created SPARK-6888: --- Summary: Make DriverQuirks editable Key: SPARK-6888 URL: https://issues.apache.org/jira/browse/SPARK-6888 Project: Spark Issue Type: Improvement Components: SQL Reporter: Rene Treffer Priority: Minor JDBC type conversion is currently handled by spark with the help of DriverQuirks (org.apache.spark.sql.jdbc.DriverQuirks). However some cases can't be resolved, e.g. MySQL BIGINT UNSIGNED. (other UNSIGNED conversions won't work either but could be resolved automatically by using the next larger type) An invalid type conversion (e.g. loading an unsigned bigint with the highest bit set as a long value) causes the jdbc driver to throw an exception. The target type is determined automatically and bound to the resulting DataFrame where it's immutable. Alternative solutions: - Subqueries. Produce extra load on the server - SQLContext / jdbc methods with schema support - Making it possible to change the schema of data frames -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6872) external sort need to copy
[ https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6872. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 5481 [https://github.com/apache/spark/pull/5481] external sort need to copy -- Key: SPARK-6872 URL: https://issues.apache.org/jira/browse/SPARK-6872 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493218#comment-14493218 ] Yu Ishikawa edited comment on SPARK-6682 at 4/14/15 12:37 AM: -- I meant (a). I agree with that we only add a script for running all of the examples to make sure they run. I think adding unit testing suites for examples is a better way to do. Although this point may not be the scope of this issue, it is a good timing to add test suites with this issue. Thanks. was (Author: yuu.ishik...@gmail.com): I meant (a). I agree with that we only add a script for running all of the examples to make sure they run. I think adding unit testing suites for examples is a better way to do. Although this point may be not the scope of this issue, it is a good timing to add test suites with this issue. Thanks. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6890: - Summary: Local cluster mode is broken (was: Local cluster mode in Mac is broken) Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6890) Local cluster mode in Mac is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or updated SPARK-6890: - Priority: Critical (was: Blocker) Local cluster mode in Mac is broken --- Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493394#comment-14493394 ] Nicholas Chammas commented on SPARK-6889: - Thanks for continuing to work on improving the contribution process, [~srowen]. The changes you are proposing look great to me (especially regarding JIRA workflow and permissions), and I wholeheartedly agree with your conclusion about directing non-committer attention and energy appropriately: {quote} I first want to figure out how to better direct the enthusiasm of new contributors instead, to make less wasted effort and thus less work all around. {quote} I'd like to add a few suggestions for everyone's consideration. \\ 1. Our contribution guide will work better when a) it is more visible and b) specific parts of it can be referenced easily. a) Visibility: Maybe it's just me, but to me the wiki feels like a dusty warehouse in a quiet part of town. People just don't go there often. The high-traffic area we already have to present contribution guidelines is [in the repo itself|https://github.com/apache/spark/blob/master/CONTRIBUTING.md]. I would favor moving the contributing guide wholesale there and reducing the wiki version to a link. b) Easy References: Our contributing guide is already quite lengthy. For the newcomer it will definitely be onerous to read through. This is unavoidable for the time being, but it does mean that people will continue to (as they have been) contribute without reading the whole guide or reading the guide at all. This means we'll want to direct people to the appropriate parts of the guide when relevant. So I think being able to link to specific sections is very important. \\ 2. We need to give more importance, culturally, to the process of turning down or redirecting work that does not fit Spark's current roadmap. And that also needs to be reflected in our contribution guide. Not having this culture, as far as I can tell, is the #2 reason we have so many open, stale PRs, which amount to wasted work and unhappy contributors. (The #1 reason is that there is simply not enough committer time to go around). This is addressed in the proposed contributing guide under the sub-section Choosing What to Contribute, but I think it needs to be much more prominent and easily reference-able. To me, this is much more important than describing the mechanics of using JIRA/GitHub (though, of course, that is still necessary). To provide a motivating example, take a look at the [contributing guide for the Phabricator project|https://secure.phabricator.com/book/phabcontrib/article/contributing_code/]. There is a large section dedicated to explaining why a patch might be rejected. Furthermore, the guide gives top prominence to the importance of coordinating first before contributing non-trivial changes. [Phabricator - Contributing Code|https://secure.phabricator.com/book/phabcontrib/article/contributing_code/]: {quote} h3. Coordinate First ... h3. Rejecting Patches If you send us a patch without coordinating it with us first, it will probably be immediately rejected, or sit in limbo for a long time and eventually be rejected. The reasons we do this vary from patch to patch, but some of the most common reasons are: ... {quote} More importantly, the Phabricator core devs back up this guide with effective action. For example, take a look at [this exchange|https://secure.phabricator.com/D9724#79498] between Evan Priestley (one of the project's leads) and a contributor, where Evan gives a firm but appropriate no to a proposed patch. [Phabricator - Allow searching pholio mocks by project|https://secure.phabricator.com/D9724#79498]: {quote} Phabricator moves pretty quickly, especially given how small the core team is. A big part of that is being aggressive about avoiding and reducing technical debt. This patch -- and patches like it -- add technical debt by solving a problem with a planned long-term solution in a short-term way. The benefit you get from us saying no here is that the project as a whole moves faster. {quote} I would love to see more Spark committers doing this on a regular basis. I'm sure people will at first feel uncomfortable about turning down work directly because it somehow feels rude, even if that work doesn't fit Spark's roadmap or is somehow otherwise off. But with the right communication and the long-term health of the project in mind, we can make it into a good habit that benefits both committers and contributors. Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components:
[jira] [Resolved] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate
[ https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-6303. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4996 [https://github.com/apache/spark/pull/4996] Remove unnecessary Average in GeneratedAggregate Key: SPARK-6303 URL: https://issues.apache.org/jira/browse/SPARK-6303 Project: Spark Issue Type: Improvement Components: SQL Reporter: Liang-Chi Hsieh Priority: Minor Fix For: 1.4.0 Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} node when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}. That is why in SQLQuerySuite there is already a test for {{avg}} with codegen. And it works. But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. Based on the above, we actually never execute this case. So we can remove this case from {{GeneratedAggregate}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493283#comment-14493283 ] Marcelo Vanzin commented on SPARK-6511: --- I think {{hadoop classpath}} would be safer w.r.t. compatibility, if you don't mind the extra overhead (it launches a JVM). One thing to remember in that case is to use the {{--config}} parameter to point to the actual config directory being used. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5888: --- Assignee: Apache Spark (was: Sandy Ryza) Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Apache Spark `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-5888: --- Assignee: Sandy Ryza (was: Apache Spark) Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer
[ https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493302#comment-14493302 ] Apache Spark commented on SPARK-5888: - User 'sryza' has created a pull request for this issue: https://github.com/apache/spark/pull/5500 Add OneHotEncoder as a Transformer -- Key: SPARK-5888 URL: https://issues.apache.org/jira/browse/SPARK-5888 Project: Spark Issue Type: Sub-task Components: ML Reporter: Xiangrui Meng Assignee: Sandy Ryza `OneHotEncoder` takes a categorical column and output a vector column, which stores the category info in binaries. {code} val ohe = new OneHotEncoder() .setInputCol(countryIndex) .setOutputCol(countries) {code} It should read the category info from the metadata and assign feature names properly in the output column. We need to discuss the default naming scheme and whether we should let it process multiple categorical columns at the same time. One category (the most frequent one) should be removed from the output to make the output columns linear independent. Or this could be an option tuned on by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-4766) ML Estimator Params should subclass Transformer Params
[ https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-4766: - Target Version/s: 1.4.0 (was: 1.3.0) ML Estimator Params should subclass Transformer Params -- Key: SPARK-4766 URL: https://issues.apache.org/jira/browse/SPARK-4766 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 1.2.0 Reporter: Joseph K. Bradley Currently, in spark.ml, both Transformers and Estimators extend the same Params classes. There should be one Params class for the Transformer and one for the Estimator, where the Estimator params class extends the Transformer one. E.g., it is weird to be able to do: {code} val model: LogisticRegressionModel = ... model.getMaxIter() {code} It's also weird to be able to: * Wrap LogisticRegressionModel (a Transformer) with CrossValidator * Pass a set of ParamMaps to CrossValidator which includes parameter LogisticRegressionModel.maxIter * (CrossValidator would try to set that parameter.) * I'm not sure if this would cause a failure or just be a noop. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6890) Local cluster mode is broken
[ https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493432#comment-14493432 ] Marcelo Vanzin commented on SPARK-6890: --- Do you have `SPARK_PREPEND_CLASSES` set by any chance? BTW, personally, I was against the change that causes this failure, and again personally I wouldn't really be against reverting it. It seems to cause more issues than it solves. [/cc [~nravi]] Local cluster mode is broken Key: SPARK-6890 URL: https://issues.apache.org/jira/browse/SPARK-6890 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 1.4.0 Reporter: Davies Liu Assignee: Andrew Or Priority: Critical In master, local cluster mode is broken. If I run `bin/spark-submit --master local-cluster[2,1,512]`, my executors keep failing with class not found exception. It appears that the assembly jar is not added to the executors' class paths. I suspect that this is caused by https://github.com/apache/spark/pull/5085. {code} Exception in thread main java.lang.NoClassDefFoundError: scala/Option at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2531) at java.lang.Class.getMethod0(Class.java:2774) at java.lang.Class.getMethod(Class.java:1663) at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494) at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486) Caused by: java.lang.ClassNotFoundException: scala.Option at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-6891) ExecutorAllocationManager will request negative number executors
[ https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] meiyoula updated SPARK-6891: Description: Below is the exception: 15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -1 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Below is the configurations I setted: spark.dynamicAllocation.enabled true spark.dynamicAllocation.minExecutors 0 spark.dynamicAllocation.initialExecutors3 spark.dynamicAllocation.maxExecutors7 spark.dynamicAllocation.executorIdleTimeout 30 spark.shuffle.service.enabled true was: Below is the exception: 15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0 java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s) -1 from the cluster manager. Please specify a positive number! at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342) at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294) at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) at java.lang.Thread.run(Thread.java:722) Below is the
[jira] [Updated] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-6889: - Attachment: ContributingtoSpark.pdf SparkProjectMechanicsChallenges.pdf Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5941) Unit Test loads the table `src` twice for leftsemijoin.q
[ https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5941. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4506 [https://github.com/apache/spark/pull/4506] Unit Test loads the table `src` twice for leftsemijoin.q Key: SPARK-5941 URL: https://issues.apache.org/jira/browse/SPARK-5941 Project: Spark Issue Type: Bug Components: SQL Reporter: Cheng Hao Fix For: 1.4.0 In leftsemijoin.q, there is a data loading command for table sales already, but in TestHive, it also creates/loads the table sales, which causes duplicated records inserted into the sales. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493251#comment-14493251 ] Kannan Rajah commented on SPARK-1529: - You can use the Compare functionality to see a single page of diffs across commits. Here is the link: https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6 A few clarifications. 1. There are 2 reasons for introducing a FileSystem abstraction in Spark instead of directly using Hadoop FileSystem. - There are Spark shuffle specific APIs that needed abstraction. Please take a look at this code: https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala - For local file system access, we can choose to circumvent using Hadoop's local file system implementation if its not efficient. If you look at LocalFileSystem.scala, for most APIs, it just delegates to the old code of using Spark's disk block manager, etc. In fact, we can just look at this single class and determine if we will hit any performance degradation for the default Apache shuffle code path. https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala 2. During the write phase, we shuffle to HDFS instead of local file system. While reading back, we don't use the Netty based transport that Apache shuffle uses. Instead we have a new implementation called DFSShuffleClient that reads from HDFS. That is the main difference. https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules
[ https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493299#comment-14493299 ] Marcelo Vanzin commented on SPARK-6889: --- I left a couple of comments on the docs themselves, but overall this looks like a good direction to follow. I've seen some projects with public issue trackers (KDE is one, I believe Mozilla does the same) where all new bugs are created with status unconfirmed, and only someone with appropriate permissions can transition the bug to open. That is similar to Sean's suggestion of having the default be Question / Minor, in a way, but I wonder how much that actually helps with managing open issues. Streamline contribution process with update to Contribution wiki, JIRA rules Key: SPARK-6889 URL: https://issues.apache.org/jira/browse/SPARK-6889 Project: Spark Issue Type: Improvement Components: Documentation Reporter: Sean Owen Assignee: Sean Owen Attachments: ContributingtoSpark.pdf, SparkProjectMechanicsChallenges.pdf From about 6 months of intimate experience with the Spark JIRA and the reality of the JIRA / PR flow, I've observed some challenges, problems and growing pains that have begun to encumber the project mechanics. In the attached SparkProjectMechanicsChallenges.pdf document, I've collected these observations and a few statistics that summarize much of what I've seen. From side conversations with several of you, I think some of these will resonate. (Read it first for this to make sense.) I'd like to improve just one aspect to start: the contribution process. A lot of inbound contribution effort gets misdirected, and can burn a lot of cycles for everyone, and that's a barrier to scaling up further and to general happiness. I'd like to propose for discussion a change to the wiki pages, and a change to some JIRA settings. *Wiki* - Replace https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with proposed text (NewContributingToSpark.pdf) - Delete https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches as it is subsumed by the new text - Move the IDE Setup section to https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools - Delete https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as it's a bit out of date and not all that useful *JIRA* Now: Start by removing everyone from the 'Developer' role and add them to 'Contributor'. Right now Developer has no permission that Contributor doesn't. We may reuse Developer later for some level between Committer and Contributor. Later, with Apache admin assistance: - Make Component and Affects Version required for new JIRAs - Set default priority to Minor and type to Question for new JIRAs. If defaults aren't changed, by default it can't be that important - Only let Committers set Target Version and Fix Version - Only let Committers set Blocker Priority -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-4848) Allow different Worker configurations in standalone cluster
[ https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrew Or closed SPARK-4848. Resolution: Fixed Fix Version/s: 1.4.0 Assignee: Nathan Kronenfeld Target Version/s: 1.4.0 Allow different Worker configurations in standalone cluster --- Key: SPARK-4848 URL: https://issues.apache.org/jira/browse/SPARK-4848 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 1.0.0 Environment: stand-alone spark cluster Reporter: Nathan Kronenfeld Assignee: Nathan Kronenfeld Fix For: 1.4.0 Original Estimate: 24h Remaining Estimate: 24h On a stand-alone spark cluster, much of the determination of worker specifics, especially one has multiple instances per node, is done only on the master. The master loops over instances, and starts a worker per instance on each node. This means, if your workers have different values of SPARK_WORKER_INSTANCES or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values are ignored except the one on the master. SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm not sure how it will behave, since all instances will read the same value from the environment. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-5794) add jar should return 0
[ https://issues.apache.org/jira/browse/SPARK-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-5794. - Resolution: Fixed Fix Version/s: 1.4.0 Issue resolved by pull request 4586 [https://github.com/apache/spark/pull/4586] add jar should return 0 --- Key: SPARK-5794 URL: https://issues.apache.org/jira/browse/SPARK-5794 Project: Spark Issue Type: Bug Components: SQL Reporter: Adrian Wang Priority: Minor Fix For: 1.4.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality
[ https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493455#comment-14493455 ] Michael Kuhlen commented on SPARK-3727: --- [~josephkb] The design document is great, thanks for sharing. Looks like a great step forward. I'd be happy to work on either or both of the subtasks, but note that I'm going to have to be a weekend warrior on this stuff (busy at work during the week). I'm going to start by familiarizing myself with spark.ml and the new API, to see if and how to port over the changes I've made so far. Trees and ensembles: More prediction functionality -- Key: SPARK-3727 URL: https://issues.apache.org/jira/browse/SPARK-3727 Project: Spark Issue Type: Improvement Components: MLlib Reporter: Joseph K. Bradley DecisionTree and RandomForest currently predict the most likely label for classification and the mean for regression. Other info about predictions would be useful. For classification: estimated probability of each possible label For regression: variance of estimate RandomForest could also create aggregate predictions in multiple ways: * Predict mean or median value for regression. * Compute variance of estimates (across all trees) for both classification and regression. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493218#comment-14493218 ] Yu Ishikawa commented on SPARK-6682: I meant (a). I agree with that we only add a script for running all of the examples to make sure they run. I think adding unit testing suites for examples is a better way to do. Although this point may be not the scope of this issue, it is a good timing to add test suites with this issue. Thanks. Deprecate static train and use builder instead for Scala/Java - Key: SPARK-6682 URL: https://issues.apache.org/jira/browse/SPARK-6682 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 1.3.0 Reporter: Joseph K. Bradley In MLlib, we have for some time been unofficially moving away from the old static train() methods and moving towards builder patterns. This JIRA is to discuss this move and (hopefully) make it official. Old static train() API: {code} val myModel = NaiveBayes.train(myData, ...) {code} New builder pattern API: {code} val nb = new NaiveBayes().setLambda(0.1) val myModel = nb.train(myData) {code} Pros of the builder pattern: * Much less code when algorithms have many parameters. Since Java does not support default arguments, we required *many* duplicated static train() methods (for each prefix set of arguments). * Helps to enforce default parameters. Users should ideally not have to even think about setting parameters if they just want to try an algorithm quickly. * Matches spark.ml API Cons of the builder pattern: * In Python APIs, static train methods are more Pythonic. Proposal: * Scala/Java: We should start deprecating the old static train() methods. We must keep them for API stability, but deprecating will help with API consistency, making it clear that everyone should use the builder pattern. As we deprecate them, we should make sure that the builder pattern supports all parameters. * Python: Keep static train methods. CC: [~mengxr] -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros
[ https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493217#comment-14493217 ] Marcelo Vanzin commented on SPARK-6511: --- We add a bunch of things to that variable, but the main thing is this: {code} $HADOOP_HOME/client/* {code} I'm not sure whether other distributions place libraries in that location, or if that's a CDH-only thing. Publish hadoop provided build with instructions for different distros --- Key: SPARK-6511 URL: https://issues.apache.org/jira/browse/SPARK-6511 Project: Spark Issue Type: Improvement Components: Build Reporter: Patrick Wendell Currently we publish a series of binaries with different Hadoop client jars. This mostly works, but some users have reported compatibility issues with different distributions. One improvement moving forward might be to publish a binary build that simply asks you to set HADOOP_HOME to pick up the Hadoop client location. That way it would work across multiple distributions, even if they have subtle incompatibilities with upstream Hadoop. I think a first step for this would be to produce such a build for the community and see how well it works. One potential issue is that our fancy excludes and dependency re-writing won't work with the simpler append Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes for dependency conflicts) or do we allow for linking against vanilla Hive at runtime. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org