[jira] [Updated] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Affects Version/s: (was: 1.4.0)
   1.3.0

 random forest predict probabilities functionality (like in sklearn)
 ---

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492888#comment-14492888
 ] 

Patrick Wendell commented on SPARK-6703:


Hey [~ilganeli] - sure thing. I've pinged a couple of people to provide 
feedback on the design. Overall I think it won't be a complicated feature to 
implement. I've added you as the assignee. One note, if it gets very close to 
the 1.4 code freeze I may need to help take it across the finish line. But for 
now why don't you go ahead, I think we'll be fine.

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Assignee: Ilya Ganelin

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492928#comment-14492928
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

[~maxkaznady] [~mqk] I split this into some subtasks, and we can add others 
later (for boosted trees, for regression, etc.).  It will be great if you can 
follow the spark.ml tree API JIRA (linked above) and take a look at it once 
it's posted.  That (and the ProbabilisticClassifier class) will give you an 
idea of what's entailed in adding these under the Pipelines API.

Do you have preferences on how to split up these tasks?  If you can figure that 
out, I'll be happy to assign them.  Thanks!

 Trees and ensembles: More prediction functionality
 --

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492839#comment-14492839
 ] 

Max Kaznady commented on SPARK-3727:


I implemented the same thing but for PySpark. Since there is no existing 
function, should I just call the function predict_proba like in sklearn? 

Also, does it make sense to open a new ticket for this, since it's so specific?

Thanks,
Max

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492959#comment-14492959
 ] 

Max Kaznady commented on SPARK-6113:


[~josephkb] Is it possible to host the API Design doc on something other than 
Google Docs? My (and most other) corporate policies forbid access to Google 
Docs, so I cannot download the file.

 Stabilize DecisionTree and ensembles APIs
 -

 Key: SPARK-6113
 URL: https://issues.apache.org/jira/browse/SPARK-6113
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
 GradientBoostedTrees) have been experimental for a long time.  The API has 
 become very convoluted because trees and ensembles have many, many variants, 
 some of which we have added incrementally without a long-term design.
 *Proposal*: This JIRA is for discussing changes required to finalize the 
 APIs.  After we discuss, I will make a PR to update the APIs and make them 
 non-Experimental.  This will require making many breaking changes; see the 
 design doc for details.
 [Design doc | 
 https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6662) Allow variable substitution in spark.yarn.historyServer.address

2015-04-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves resolved SPARK-6662.
--
   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Cheolsoo Park

 Allow variable substitution in spark.yarn.historyServer.address
 ---

 Key: SPARK-6662
 URL: https://issues.apache.org/jira/browse/SPARK-6662
 Project: Spark
  Issue Type: Wish
  Components: YARN
Affects Versions: 1.3.0
Reporter: Cheolsoo Park
Assignee: Cheolsoo Park
Priority: Minor
  Labels: yarn
 Fix For: 1.4.0


 In Spark on YARN, explicit hostname and port number need to be set for 
 spark.yarn.historyServer.address in SparkConf to make the HISTORY link. If 
 the history server address is known and static, this is usually not a problem.
 But in cloud, that is usually not true. Particularly in EMR, the history 
 server always runs on the same node as with RM. So I could simply set it to 
 {{$\{yarn.resourcemanager.hostname\}:18080}} if variable substitution is 
 allowed.
 In fact, Hadoop configuration already implements variable substitution, so if 
 this property is read via YarnConf, this can be easily achievable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492887#comment-14492887
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

Thanks for your initial works on this ticket!  The main issue with this 
extension is API stability: Modifying the existing classes will also make us 
have to update model save/load versioning, default constructors to ensure 
binary compatibility, etc.

I just linked a JIRA which discusses updating the tree and ensemble APIs under 
the spark.ml package, which will permit us to redesign the APIs (and make it 
easier to specify class probabilities or stats for regression).  What I'd like 
to do is get the tree API updates in (this week), and then we could work 
together to make the class probabilities available under the new API.

Does that sound good?

Also, if you're new to contributing to Spark, please make sure to check out: 
[https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark]

Thanks!

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Patrick Wendell updated SPARK-6703:
---
Priority: Critical  (was: Major)

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Assignee: Max Kaznady

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
Assignee: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6884:
-
Assignee: (was: Max Kaznady)

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492898#comment-14492898
 ] 

Patrick Wendell commented on SPARK-6703:


/cc [~velvia]

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6113) Stabilize DecisionTree and ensembles APIs

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492989#comment-14492989
 ] 

Max Kaznady commented on SPARK-6113:


Other places need serious improvement as well, LogisticRegressionWithLBFGS is 
another example.
 
All LogisticRegression classifiers need a logistic function. I found this 
ticket, but I’m not sure why it’s closed:
https://issues.apache.org/jira/browse/SPARK-3585
 
I think LogisticRegression and RandomForest should have the same name for the 
predict_proba function. I would just call it that, since then at least PySpark 
is consistent with sklearn library.
 
Internally logistic function should be implemented as a single function, not 
hard-coded in multiple places the way that it is now. That’s another ticket.
 
Aside: I haven’t looked at LogisticRegressionWithSGD, but it fails horribly 
sometimes: algo either diverges or gets stuck in local minima.


 Stabilize DecisionTree and ensembles APIs
 -

 Key: SPARK-6113
 URL: https://issues.apache.org/jira/browse/SPARK-6113
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib, PySpark
Affects Versions: 1.4.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley
Priority: Critical

 *Issue*: The APIs for DecisionTree and ensembles (RandomForests and 
 GradientBoostedTrees) have been experimental for a long time.  The API has 
 become very convoluted because trees and ensembles have many, many variants, 
 some of which we have added incrementally without a long-term design.
 *Proposal*: This JIRA is for discussing changes required to finalize the 
 APIs.  After we discuss, I will make a PR to update the APIs and make them 
 non-Experimental.  This will require making many breaking changes; see the 
 design doc for details.
 [Design doc | 
 https://docs.google.com/document/d/1rJ_DZinyDG3PkYkAKSsQlY0QgCeefn4hUv7GsPkzBP4]:
  This outlines current issues and the proposed API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) DecisionTree, RandomForest: More prediction functionality

2015-04-13 Thread Max Kaznady (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492906#comment-14492906
 ] 

Max Kaznady commented on SPARK-3727:


Yes, probabilities have to be added to other models too, like 
LogisticRegression. Right now they are hardcoded in two places but not 
outputted in PySpark.

I think is makes sense to split into PySpark, then classification, then 
probabilities, and then group different types of algorithms, all of which 
output probabilities: Logistic Regression, Random Forest, etc.

Can also add probabilities for trees by counting the number of leaf 1's and 0's.

What do you think?

 DecisionTree, RandomForest: More prediction functionality
 -

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) Random forest: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492904#comment-14492904
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

I'd recommend: Under spark.ml, have RandomForestClassifier (currently being 
added) extend ProbabilisticClassifier.

 Random forest: predict class probabilities
 --

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3727:
-
Summary: Trees and ensembles: More prediction functionality  (was: 
DecisionTree, RandomForest: More prediction functionality)

 Trees and ensembles: More prediction functionality
 --

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Ilya Ganelin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492909#comment-14492909
 ] 

Ilya Ganelin commented on SPARK-6703:
-

Patrick - what¹s the time line for the 1.4 release? Just want to have a
sense for it so I can schedule accordingly.

Thank you, 
Ilya Ganelin










The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492867#comment-14492867
 ] 

Joseph K. Bradley commented on SPARK-6682:
--

Do you mean (a) tests to make sure the examples work, or (b) treating the 
examples as tests themselves?  We should not do (b) since it mixes tests and 
examples.

For (a), we don't have a great solution currently, although I think we should 
(at some point) add a script for running all of the examples to make sure they 
run.  I don't think we need performance tests for examples since they are meant 
to be short usage examples, not end solutions or applications.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492892#comment-14492892
 ] 

Joseph K. Bradley commented on SPARK-6884:
--

Is this not a duplicate of [SPARK-3727]?  Perhaps the best way to split up the 
work will be to make a subtask for trees, and a separate subtask for ensembles. 
 I'll go ahead and do that.

 random forest predict probabilities functionality (like in sklearn)
 ---

 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.3.0
 Environment: cross-platform
Reporter: Max Kaznady
  Labels: prediction, probability, randomforest, tree
   Original Estimate: 72h
  Remaining Estimate: 72h

 Currently, there is no way to extract the class probabilities from the 
 RandomForest classifier. I implemented a probability predictor by counting 
 votes from individual trees and adding up their votes for 1 and then 
 dividing by the total number of votes.
 I opened this ticked to keep track of changes. Will update once I push my 
 code to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6885) Decision trees: predict class probabilities

2015-04-13 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-6885:


 Summary: Decision trees: predict class probabilities
 Key: SPARK-6885
 URL: https://issues.apache.org/jira/browse/SPARK-6885
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley


Under spark.ml, have DecisionTreeClassifier (currently being added) extend 
ProbabilisticClassifier.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5988) Model import/export for PowerIterationClusteringModel

2015-04-13 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-5988.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5450
[https://github.com/apache/spark/pull/5450]

 Model import/export for PowerIterationClusteringModel
 -

 Key: SPARK-5988
 URL: https://issues.apache.org/jira/browse/SPARK-5988
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Xusen Yin
 Fix For: 1.4.0


 Add save/load for PowerIterationClusteringModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6884) random forest predict probabilities functionality (like in sklearn)

2015-04-13 Thread Max Kaznady (JIRA)
Max Kaznady created SPARK-6884:
--

 Summary: random forest predict probabilities functionality (like 
in sklearn)
 Key: SPARK-6884
 URL: https://issues.apache.org/jira/browse/SPARK-6884
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.4.0
 Environment: cross-platform
Reporter: Max Kaznady


Currently, there is no way to extract the class probabilities from the 
RandomForest classifier. I implemented a probability predictor by counting 
votes from individual trees and adding up their votes for 1 and then dividing 
by the total number of votes.

I opened this ticked to keep track of changes. Will update once I push my code 
to master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492931#comment-14492931
 ] 

Joseph K. Bradley commented on SPARK-3727:
--

[~maxkaznady] Implementations should be done in Scala; the PySpark API will be 
a wrapper.  The API update JIRA I'm referencing should clear up some of the 
other questions.

 Trees and ensembles: More prediction functionality
 --

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6887) ColumnBuilder misses FloatType

2015-04-13 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai updated SPARK-6887:

Description: 
To reproduce ...
{code}
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

val schema = StructType(StructField(c, FloatType, true) :: Nil)

val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat))

sqlContext.createDataFrame(rdd, schema).registerTempTable(test)

sqlContext.sql(cache table test)

sqlContext.table(test).show
{code}
The exception is ...
{code}
15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at 
SparkPlan.scala:88, took 0.474392 s
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 
5, localhost): java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to 
org.apache.spark.sql.catalyst.expressions.MutableLong
at 
org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292)
at 
org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539)
at 
org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
at 
org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
at 
org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295)
at 
org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290)
at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130)
at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
{code}

 ColumnBuilder misses FloatType
 --

 Key: SPARK-6887
 URL: https://issues.apache.org/jira/browse/SPARK-6887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.4.0


 To reproduce ...
 {code}
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.Row
 val schema = StructType(StructField(c, FloatType, true) :: Nil)
 val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat))
 sqlContext.createDataFrame(rdd, schema).registerTempTable(test)
 sqlContext.sql(cache table test)
 sqlContext.table(test).show
 {code}
 The exception is ...
 {code}
 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at 
 SparkPlan.scala:88, took 0.474392 s
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
 (TID 5, localhost): java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292)
   at 
 

[jira] [Assigned] (SPARK-6887) ColumnBuilder misses FloatType

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6887:
---

Assignee: Apache Spark  (was: Yin Huai)

 ColumnBuilder misses FloatType
 --

 Key: SPARK-6887
 URL: https://issues.apache.org/jira/browse/SPARK-6887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Apache Spark
 Fix For: 1.4.0


 To reproduce ...
 {code}
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.Row
 val schema = StructType(StructField(c, FloatType, true) :: Nil)
 val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat))
 sqlContext.createDataFrame(rdd, schema).registerTempTable(test)
 sqlContext.sql(cache table test)
 sqlContext.table(test).show
 {code}
 The exception is ...
 {code}
 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at 
 SparkPlan.scala:88, took 0.474392 s
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
 (TID 5, localhost): java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292)
   at 
 org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290)
   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API

2015-04-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493193#comment-14493193
 ] 

Reynold Xin commented on SPARK-6865:


As discussed offline, it would makes more sense to go with option 1, i.e.

- str is treated as a quoted identifier in SQL, equivalent to `str`.
- * is a special case in which it refers to all the columns in a data frame. 
(Note that this means we cannot have a column named *, which I think is fine.)

The reason is that strings are already quoted, and programmers expect them to 
be quoted literals without extra escaping.

We will need to fix our resolver with respect to dots.


 Decide on semantics for string identifiers in DataFrame API
 ---

 Key: SPARK-6865
 URL: https://issues.apache.org/jira/browse/SPARK-6865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 There are two options:
  - Quoted Identifiers: meaning that the strings are treated as though they 
 were in backticks in SQL.  Any weird characters (spaces, or, etc) are 
 considered part of the identifier.  Kind of weird given that `*` is already a 
 special identifier explicitly allowed by the API
  - Unquoted parsed identifiers: would allow users to specify things like 
 tableAlias.*  However, would also require explicit use of `backticks` for 
 identifiers with weird characters in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Mandar Chandorkar (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mandar Chandorkar updated SPARK-4638:
-
Attachment: kernels-1.3.patch

Patch for the kernels implementation taken against the current branch-1.3 of 
apache spark

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5972) Cache residuals for GradientBoostedTrees during training

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-5972.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5330
[https://github.com/apache/spark/pull/5330]

 Cache residuals for GradientBoostedTrees during training
 

 Key: SPARK-5972
 URL: https://issues.apache.org/jira/browse/SPARK-5972
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley
Assignee: Manoj Kumar
Priority: Minor
 Fix For: 1.4.0


 In gradient boosting, the current model's prediction is re-computed for each 
 training instance on every iteration.  The current residual (cumulative 
 prediction of previously trained trees in the ensemble) should be cached.  
 That could reduce both computation (only computing the prediction of the most 
 recently trained tree) and communication (only sending the most recently 
 trained tree to the workers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183
 ] 

Patrick Wendell edited comment on SPARK-6511 at 4/13/15 10:11 PM:
--

Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

/cc [~srowen]


was (Author: pwendell):
Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6886) Big closure in PySpark will fail during shuffle

2015-04-13 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6886:
-

 Summary: Big closure in PySpark will fail during shuffle
 Key: SPARK-6886
 URL: https://issues.apache.org/jira/browse/SPARK-6886
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.3.0, 1.2.1, 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker


Reported by  beifei.zhou beifei.zhou at ximalaya.com: 

I am using spark to process bid datasets. However, there is always problem when 
executing reduceByKey on a large dataset, whereas with a smaller dataset.  May 
I asked you how could I solve this issue?

The error is always like this:
{code}
15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in main
command = pickleSer.loads(command.value)
  File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, in 
value
self._value = self.load(self._path)
  File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in 
load
with open(path, 'rb', 1  20) as f:
IOError: [Errno 2] No such file or directory: 
'/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG'
{code}

Here I attach my code:
{code}
import codecs
from pyspark import SparkContext, SparkConf
from operator import add 
import operator
from pyspark.storagelevel import StorageLevel

def combine_dict(a,b):
a.update(b)
return a
conf = SparkConf()
sc = SparkContext(appName = tag)
al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: 
x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: 
combine_dict(a,b))

result = sc.textFile('uidAlbumscore.txt')\
.map(lambda x: x.split(','))\
.filter(lambda x: x[1] in al_tag_dict.keys())\
.map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\
.map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\
.flatMap(lambda x: x)\ 
.map(lambda x: (str(x[0][0]), x[1]))\
.reduceByKey(add)\
#.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\
#.reduce(add)
#codecs.open('tag_score.txt','w','utf-8').write(result)
print result.first()
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6130) support if not exists for insert overwrite into partition in hiveQl

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6130.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4865
[https://github.com/apache/spark/pull/4865]

 support if not exists for insert overwrite into partition in hiveQl
 ---

 Key: SPARK-6130
 URL: https://issues.apache.org/jira/browse/SPARK-6130
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.4.0


 Standard syntax:
 INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 
 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
 INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
 select_statement1 FROM from_statement;
  
 Hive extension (multiple inserts):
 FROM from_statement
 INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 
 ...) [IF NOT EXISTS]] select_statement1
 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] 
 select_statement2]
 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
 FROM from_statement
 INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] 
 select_statement1
 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2]
 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] 
 select_statement2] ...;
  
 Hive extension (dynamic partition inserts):
 INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] 
 ...) select_statement FROM from_statement;
 INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) 
 select_statement FROM from_statement;



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6886) Big closure in PySpark will fail during shuffle

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6886:
---

Assignee: Davies Liu  (was: Apache Spark)

 Big closure in PySpark will fail during shuffle
 ---

 Key: SPARK-6886
 URL: https://issues.apache.org/jira/browse/SPARK-6886
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1, 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 Reported by  beifei.zhou beifei.zhou at ximalaya.com: 
 I am using spark to process bid datasets. However, there is always problem 
 when executing reduceByKey on a large dataset, whereas with a smaller 
 dataset.  May I asked you how could I solve this issue?
 The error is always like this:
 {code}
 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5)
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in 
 main
 command = pickleSer.loads(command.value)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, 
 in value
 self._value = self.load(self._path)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in 
 load
 with open(path, 'rb', 1  20) as f:
 IOError: [Errno 2] No such file or directory: 
 '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG'
 {code}
 Here I attach my code:
 {code}
 import codecs
 from pyspark import SparkContext, SparkConf
 from operator import add 
 import operator
 from pyspark.storagelevel import StorageLevel
 def combine_dict(a,b):
 a.update(b)
 return a
 conf = SparkConf()
 sc = SparkContext(appName = tag)
 al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: 
 x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: 
 combine_dict(a,b))
 result = sc.textFile('uidAlbumscore.txt')\
 .map(lambda x: x.split(','))\
 .filter(lambda x: x[1] in al_tag_dict.keys())\
 .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\
 .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\
 .flatMap(lambda x: x)\ 
 .map(lambda x: (str(x[0][0]), x[1]))\
 .reduceByKey(add)\
 #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\
 #.reduce(add)
 #codecs.open('tag_score.txt','w','utf-8').write(result)
 print result.first()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6886) Big closure in PySpark will fail during shuffle

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493068#comment-14493068
 ] 

Apache Spark commented on SPARK-6886:
-

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/5496

 Big closure in PySpark will fail during shuffle
 ---

 Key: SPARK-6886
 URL: https://issues.apache.org/jira/browse/SPARK-6886
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1, 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Davies Liu
Priority: Blocker

 Reported by  beifei.zhou beifei.zhou at ximalaya.com: 
 I am using spark to process bid datasets. However, there is always problem 
 when executing reduceByKey on a large dataset, whereas with a smaller 
 dataset.  May I asked you how could I solve this issue?
 The error is always like this:
 {code}
 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5)
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in 
 main
 command = pickleSer.loads(command.value)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, 
 in value
 self._value = self.load(self._path)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in 
 load
 with open(path, 'rb', 1  20) as f:
 IOError: [Errno 2] No such file or directory: 
 '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG'
 {code}
 Here I attach my code:
 {code}
 import codecs
 from pyspark import SparkContext, SparkConf
 from operator import add 
 import operator
 from pyspark.storagelevel import StorageLevel
 def combine_dict(a,b):
 a.update(b)
 return a
 conf = SparkConf()
 sc = SparkContext(appName = tag)
 al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: 
 x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: 
 combine_dict(a,b))
 result = sc.textFile('uidAlbumscore.txt')\
 .map(lambda x: x.split(','))\
 .filter(lambda x: x[1] in al_tag_dict.keys())\
 .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\
 .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\
 .flatMap(lambda x: x)\ 
 .map(lambda x: (str(x[0][0]), x[1]))\
 .reduceByKey(add)\
 #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\
 #.reduce(add)
 #codecs.open('tag_score.txt','w','utf-8').write(result)
 print result.first()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6886) Big closure in PySpark will fail during shuffle

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6886:
---

Assignee: Apache Spark  (was: Davies Liu)

 Big closure in PySpark will fail during shuffle
 ---

 Key: SPARK-6886
 URL: https://issues.apache.org/jira/browse/SPARK-6886
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.2.1, 1.3.0, 1.4.0
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Blocker

 Reported by  beifei.zhou beifei.zhou at ximalaya.com: 
 I am using spark to process bid datasets. However, there is always problem 
 when executing reduceByKey on a large dataset, whereas with a smaller 
 dataset.  May I asked you how could I solve this issue?
 The error is always like this:
 {code}
 15/04/09 11:27:46 ERROR Executor: Exception in task 3.0 in stage 1.0 (TID 5)
 org.apache.spark.api.python.PythonException: Traceback (most recent call 
 last):
   File /Users/nali/Softwares/spark/python/pyspark/worker.py, line 90, in 
 main
 command = pickleSer.loads(command.value)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 106, 
 in value
 self._value = self.load(self._path)
   File /Users/nali/Softwares/spark/python/pyspark/broadcast.py, line 87, in 
 load
 with open(path, 'rb', 1  20) as f:
 IOError: [Errno 2] No such file or directory: 
 '/private/var/folders/_x/n59vb1b54pl96lvldz2lr_v4gn/T/spark-37d8ecbc-9ac9-4aa2-be23-12823f4cd1ed/pyspark-1e3d5904-a5b6-4222-a146-91bfdb4a33a7/tmp8XMhgG'
 {code}
 Here I attach my code:
 {code}
 import codecs
 from pyspark import SparkContext, SparkConf
 from operator import add 
 import operator
 from pyspark.storagelevel import StorageLevel
 def combine_dict(a,b):
 a.update(b)
 return a
 conf = SparkConf()
 sc = SparkContext(appName = tag)
 al_tag_dict = sc.textFile('albumtag.txt').map(lambda x: 
 x.split(',')).map(lambda x: {x[0]: x[1:]}).reduce(lambda a, b: 
 combine_dict(a,b))
 result = sc.textFile('uidAlbumscore.txt')\
 .map(lambda x: x.split(','))\
 .filter(lambda x: x[1] in al_tag_dict.keys())\
 .map(lambda x: (x[0], al_tag_dict[x[1]], float(x[2])))\
 .map(lambda x: map(lambda a: ((x[0], a), x[2]), x[1]))\
 .flatMap(lambda x: x)\ 
 .map(lambda x: (str(x[0][0]), x[1]))\
 .reduceByKey(add)\
 #.map(lambda x: x[0][0]+','+x[0][1]+','+str(x[1])+'\n')\
 #.reduce(add)
 #codecs.open('tag_score.txt','w','utf-8').write(result)
 print result.first()
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6368:
---

Assignee: Yin Huai  (was: Apache Spark)

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6368:
---

Assignee: Apache Spark  (was: Yin Huai)

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Apache Spark
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6368) Build a specialized serializer for Exchange operator.

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493092#comment-14493092
 ] 

Apache Spark commented on SPARK-6368:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5497

 Build a specialized serializer for Exchange operator. 
 --

 Key: SPARK-6368
 URL: https://issues.apache.org/jira/browse/SPARK-6368
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Yin Huai
Assignee: Yin Huai
Priority: Critical

 Kryo is still pretty slow because it works on individual objects and relative 
 expensive to allocate. For Exchange operator, because the schema for key and 
 value are already defined, we can create a specialized serializer to handle 
 the specific schemas of key and value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493179#comment-14493179
 ] 

Patrick Wendell commented on SPARK-6703:


Yes, ideally we get it into 1.4 - though I think the ultimate solution here 
could be a very small patch.

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493180#comment-14493180
 ] 

Sean Owen commented on SPARK-4638:
--

[~mandar2812] Spark does not use patches in JIRA but uses pull requests. Also 
changes should be vs master, not a branch.
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5632:
---
Issue Type: Sub-task  (was: Bug)
Parent: SPARK-6116

 not able to resolve dot('.') in field name
 --

 Key: SPARK-5632
 URL: https://issues.apache.org/jira/browse/SPARK-5632
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
Reporter: Lishu Liu
Priority: Blocker

 My cassandra table task_trace has a field sm.result which contains dot in the 
 name. So SQL tried to look up sm instead of full name 'sm.result'. 
 Here is my code: 
 {code}
 scala import org.apache.spark.sql.cassandra.CassandraSQLContext
 scala val cc = new CassandraSQLContext(sc)
 scala val task_trace = cc.jsonFile(/task_trace.json)
 scala task_trace.registerTempTable(task_trace)
 scala cc.setKeyspace(cerberus_data_v4)
 scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, 
 task_body.sm.result FROM task_trace WHERE task_id = 
 'fff7304e-9984-4b45-b10c-0423a96745ce')
 res: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[57] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 == Physical Plan ==
 java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
 cerberus_id, couponId, coupon_code, created, description, domain, expires, 
 message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
 sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
 validity
 {code}
 The full schema look like this:
 {code}
 scala task_trace.printSchema()
 root
  \|-- received_datetime: long (nullable = true)
  \|-- task_body: struct (nullable = true)
  \|\|-- cerberus_batch_id: string (nullable = true)
  \|\|-- cerberus_id: string (nullable = true)
  \|\|-- couponId: integer (nullable = true)
  \|\|-- coupon_code: string (nullable = true)
  \|\|-- created: string (nullable = true)
  \|\|-- description: string (nullable = true)
  \|\|-- domain: string (nullable = true)
  \|\|-- expires: string (nullable = true)
  \|\|-- message_id: string (nullable = true)
  \|\|-- neverShowAfter: string (nullable = true)
  \|\|-- neverShowBefore: string (nullable = true)
  \|\|-- offerTitle: string (nullable = true)
  \|\|-- screenshots: array (nullable = true)
  \|\|\|-- element: string (containsNull = false)
  \|\|-- sm.result: struct (nullable = true)
  \|\|\|-- cerberus_batch_id: string (nullable = true)
  \|\|\|-- cerberus_id: string (nullable = true)
  \|\|\|-- code: string (nullable = true)
  \|\|\|-- couponId: integer (nullable = true)
  \|\|\|-- created: string (nullable = true)
  \|\|\|-- description: string (nullable = true)
  \|\|\|-- domain: string (nullable = true)
  \|\|\|-- expires: string (nullable = true)
  \|\|\|-- message_id: string (nullable = true)
  \|\|\|-- neverShowAfter: string (nullable = true)
  \|\|\|-- neverShowBefore: string (nullable = true)
  \|\|\|-- offerTitle: string (nullable = true)
  \|\|\|-- result: struct (nullable = true)
  \|\|\|\|-- post: struct (nullable = true)
  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
  \|\|\|\|\|\|-- ci: double (nullable = true)
  \|\|\|\|\|\|-- value: boolean (nullable = true)
  \|\|\|\|\|-- meta: struct (nullable = true)
  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- exceptions: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
  \|\|\|\|\|\|\|-- element: array (containsNull = 
 false)
  \|\|\|\|\|\|\|\|-- element: string (containsNull 
 = false)
  \|\|\|\|\|-- now_price_checkout: struct (nullable = true)
  \|\|\|\|\|\|-- ci: double (nullable = true)
  \|\|\|\|\|\|-- value: double (nullable = true)
  \|\|\|\|\|-- shipping_price: struct (nullable = true)
  \|\|\|\|\|\|-- ci: double 

[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493183#comment-14493183
 ] 

Patrick Wendell commented on SPARK-6511:


Just as an example I tried to wire Spark to work with stock Hadoop 2.6. Here is 
how I got it running after doing a hadoop-provided build. This is pretty 
clunky, so I wonder if we should just support setting HADOOP_HOME or something 
and we can automatically find and add the jar files present within that folder.

{code}
export SPARK_DIST_CLASSPATH=$(find /tmp/hadoop-2.6.0/ -name *.jar | tr \n ;)
./bin/spark-shell
{code}

[~vanzin] for your CDH packages, what do you end up setting 
SPARK_DIST_CLASSPATH to?

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493200#comment-14493200
 ] 

Sean Owen commented on SPARK-6511:
--

Yeah that might be the fastest way to find all the jars at once. They occur in 
various places in the raw Hadoop distro. That's really not too bad, this one 
liner. I don't know if it's so great to start then also modifying the classpath 
based on HADOOP_HOME as this might not be what the end user wants or interfere 
with an explicitly configured classpath.

In something like CDH they're all laid out in one directory, per components, so 
are easier to find, but that isn't much different. I don't see that the distro 
sets SPARK_DIST_CLASSPATH but sets things like SPARK_LIBRARY_PATH in 
spark-env.sh to ${SPARK_HOME}/lib. I actually don't see where the Hadoop deps 
come in but it is going to be something similar. The effect is about the same, 
to add all of the Hadoop client and YARN jars to the classpath too.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-1701) Inconsistent naming: slice or partition

2015-04-13 Thread Thomas Graves (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas Graves reassigned SPARK-1701:


Assignee: Thomas Graves

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Assignee: Thomas Graves
Priority: Minor
  Labels: starter
 Fix For: 1.2.0


 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6742) Spark pushes down filters in old parquet path that reference partitioning columns

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6742.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5390
[https://github.com/apache/spark/pull/5390]

 Spark pushes down filters in old parquet path that reference partitioning 
 columns
 -

 Key: SPARK-6742
 URL: https://issues.apache.org/jira/browse/SPARK-6742
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.1
Reporter: Yash Datta
 Fix For: 1.4.0


 Create a table with multiple fields partitioned on 'market' column. run a 
 query like : 
 SELECT start_sp_time, end_sp_time, imsi, imei,  enb_common_enbid FROM 
 csl_data_parquet WHERE (((technology = 'FDD') AND (bandclass = '800') AND 
 (region = 'R15') AND (market = 'LA metro')) OR ((technology = 'FDD') AND 
 (bandclass = '1900') AND (region = 'R15') AND (market = 'Indianapolis'))) AND 
 start_sp_time = 1.4158368E9 AND end_sp_time  1.4159232E9 AND dt = 
 '2014-11-13-00-00' AND dt  '2014-11-14-00-00' ORDER BY end_sp_time DESC 
 LIMIT 100
 The or filter is pushed down in this case , resulting in column not found 
 exception from parquet 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5632) not able to resolve dot('.') in field name

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin updated SPARK-5632:
---
 Priority: Blocker  (was: Major)
 Target Version/s: 1.4.0
Affects Version/s: 1.3.0

 not able to resolve dot('.') in field name
 --

 Key: SPARK-5632
 URL: https://issues.apache.org/jira/browse/SPARK-5632
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.2.0, 1.3.0
 Environment: Spark cluster: EC2 m1.small + Spark 1.2.0
 Cassandra cluster: EC2 m3.xlarge + Cassandra 2.1.2
Reporter: Lishu Liu
Priority: Blocker

 My cassandra table task_trace has a field sm.result which contains dot in the 
 name. So SQL tried to look up sm instead of full name 'sm.result'. 
 Here is my code: 
 {code}
 scala import org.apache.spark.sql.cassandra.CassandraSQLContext
 scala val cc = new CassandraSQLContext(sc)
 scala val task_trace = cc.jsonFile(/task_trace.json)
 scala task_trace.registerTempTable(task_trace)
 scala cc.setKeyspace(cerberus_data_v4)
 scala val res = cc.sql(SELECT received_datetime, task_body.cerberus_id, 
 task_body.sm.result FROM task_trace WHERE task_id = 
 'fff7304e-9984-4b45-b10c-0423a96745ce')
 res: org.apache.spark.sql.SchemaRDD = 
 SchemaRDD[57] at RDD at SchemaRDD.scala:108
 == Query Plan ==
 == Physical Plan ==
 java.lang.RuntimeException: No such struct field sm in cerberus_batch_id, 
 cerberus_id, couponId, coupon_code, created, description, domain, expires, 
 message_id, neverShowAfter, neverShowBefore, offerTitle, screenshots, 
 sm.result, sm.task, startDate, task_id, url, uuid, validationDateTime, 
 validity
 {code}
 The full schema look like this:
 {code}
 scala task_trace.printSchema()
 root
  \|-- received_datetime: long (nullable = true)
  \|-- task_body: struct (nullable = true)
  \|\|-- cerberus_batch_id: string (nullable = true)
  \|\|-- cerberus_id: string (nullable = true)
  \|\|-- couponId: integer (nullable = true)
  \|\|-- coupon_code: string (nullable = true)
  \|\|-- created: string (nullable = true)
  \|\|-- description: string (nullable = true)
  \|\|-- domain: string (nullable = true)
  \|\|-- expires: string (nullable = true)
  \|\|-- message_id: string (nullable = true)
  \|\|-- neverShowAfter: string (nullable = true)
  \|\|-- neverShowBefore: string (nullable = true)
  \|\|-- offerTitle: string (nullable = true)
  \|\|-- screenshots: array (nullable = true)
  \|\|\|-- element: string (containsNull = false)
  \|\|-- sm.result: struct (nullable = true)
  \|\|\|-- cerberus_batch_id: string (nullable = true)
  \|\|\|-- cerberus_id: string (nullable = true)
  \|\|\|-- code: string (nullable = true)
  \|\|\|-- couponId: integer (nullable = true)
  \|\|\|-- created: string (nullable = true)
  \|\|\|-- description: string (nullable = true)
  \|\|\|-- domain: string (nullable = true)
  \|\|\|-- expires: string (nullable = true)
  \|\|\|-- message_id: string (nullable = true)
  \|\|\|-- neverShowAfter: string (nullable = true)
  \|\|\|-- neverShowBefore: string (nullable = true)
  \|\|\|-- offerTitle: string (nullable = true)
  \|\|\|-- result: struct (nullable = true)
  \|\|\|\|-- post: struct (nullable = true)
  \|\|\|\|\|-- alchemy_out_of_stock: struct (nullable = true)
  \|\|\|\|\|\|-- ci: double (nullable = true)
  \|\|\|\|\|\|-- value: boolean (nullable = true)
  \|\|\|\|\|-- meta: struct (nullable = true)
  \|\|\|\|\|\|-- None_tx_value: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- exceptions: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- no_input_value: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- not_mapped: array (nullable = true)
  \|\|\|\|\|\|\|-- element: string (containsNull = 
 false)
  \|\|\|\|\|\|-- not_transformed: array (nullable = true)
  \|\|\|\|\|\|\|-- element: array (containsNull = 
 false)
  \|\|\|\|\|\|\|\|-- element: string (containsNull 
 = false)
  \|\|\|\|\|-- now_price_checkout: struct (nullable = true)
  \|\|\|\|\|\|-- ci: double (nullable = true)
  \|\|\|\|\|\|-- value: double (nullable = true)
  \|\|\|\|\|-- shipping_price: struct (nullable = true)
  \|\|\|  

[jira] [Comment Edited] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API

2015-04-13 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493193#comment-14493193
 ] 

Reynold Xin edited comment on SPARK-6865 at 4/13/15 10:26 PM:
--

As discussed offline, it would makes more sense to go with option 1, i.e.

- str is treated as a quoted identifier in SQL, equivalent to `str`.
- *  is a special case in which it refers to all the columns in a data frame. 
(Note that this means we cannot have a column named *, which I think is fine.)

The reason is that strings are already quoted, and programmers expect them to 
be quoted literals without extra escaping.

We will need to fix our resolver with respect to dots.



was (Author: rxin):
As discussed offline, it would makes more sense to go with option 1, i.e.

- str is treated as a quoted identifier in SQL, equivalent to `str`.
- * is a special case in which it refers to all the columns in a data frame. 
(Note that this means we cannot have a column named *, which I think is fine.)

The reason is that strings are already quoted, and programmers expect them to 
be quoted literals without extra escaping.

We will need to fix our resolver with respect to dots.


 Decide on semantics for string identifiers in DataFrame API
 ---

 Key: SPARK-6865
 URL: https://issues.apache.org/jira/browse/SPARK-6865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Priority: Blocker

 There are two options:
  - Quoted Identifiers: meaning that the strings are treated as though they 
 were in backticks in SQL.  Any weird characters (spaces, or, etc) are 
 considered part of the identifier.  Kind of weird given that `*` is already a 
 special identifier explicitly allowed by the API
  - Unquoted parsed identifiers: would allow users to specify things like 
 tableAlias.*  However, would also require explicit use of `backticks` for 
 identifiers with weird characters in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6865) Decide on semantics for string identifiers in DataFrame API

2015-04-13 Thread Reynold Xin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Reynold Xin resolved SPARK-6865.

   Resolution: Fixed
Fix Version/s: 1.4.0
 Assignee: Reynold Xin

This is now decided.

 Decide on semantics for string identifiers in DataFrame API
 ---

 Key: SPARK-6865
 URL: https://issues.apache.org/jira/browse/SPARK-6865
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Michael Armbrust
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.4.0


 There are two options:
  - Quoted Identifiers: meaning that the strings are treated as though they 
 were in backticks in SQL.  Any weird characters (spaces, or, etc) are 
 considered part of the identifier.  Kind of weird given that `*` is already a 
 special identifier explicitly allowed by the API
  - Unquoted parsed identifiers: would allow users to specify things like 
 tableAlias.*  However, would also require explicit use of `backticks` for 
 identifiers with weird characters in them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-2873) Support disk spilling in Spark SQL aggregation / join

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-2873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust updated SPARK-2873:

Priority: Blocker  (was: Major)

 Support disk spilling in Spark SQL aggregation / join
 -

 Key: SPARK-2873
 URL: https://issues.apache.org/jira/browse/SPARK-2873
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.0.0
Reporter: guowei
Priority: Blocker





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1701) Inconsistent naming: slice or partition

2015-04-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493147#comment-14493147
 ] 

Nicholas Chammas commented on SPARK-1701:
-

[~tgraves] - Shouldn't this issue be assigned to [~farrellee]?

 Inconsistent naming: slice or partition
 ---

 Key: SPARK-1701
 URL: https://issues.apache.org/jira/browse/SPARK-1701
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, Spark Core
Reporter: Daniel Darabos
Assignee: Thomas Graves
Priority: Minor
  Labels: starter
 Fix For: 1.2.0


 Throughout the documentation and code slice and partition are used 
 interchangeably. (Or so it seems to me.) It would avoid some confusion for 
 new users to settle on one name. I think partition is winning, since that 
 is the name of the class representing the concept.
 This should not be much more complicated to do than a search  replace. I can 
 take a stab at it, if you agree.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-4638) Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to find non linear boundaries

2015-04-13 Thread Mandar Chandorkar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493174#comment-14493174
 ] 

Mandar Chandorkar edited comment on SPARK-4638 at 4/13/15 10:05 PM:


Patch for the kernels implementation taken against the current branch-1.3 of 
apache spark
[~amuise]: The patch as you requested, if there are any problems with it or if 
you need anything else, I will do my best to supply it. Thank you.


was (Author: mandar2812):
Patch for the kernels implementation taken against the current branch-1.3 of 
apache spark

 Spark's MLlib SVM classification to include Kernels like Gaussian / (RBF) to 
 find non linear boundaries
 ---

 Key: SPARK-4638
 URL: https://issues.apache.org/jira/browse/SPARK-4638
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Reporter: madankumar s
  Labels: Gaussian, Kernels, SVM
 Attachments: kernels-1.3.patch


 SPARK MLlib Classification Module:
 Add Kernel functionalities to SVM Classifier to find non linear patterns



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6887) ColumnBuilder misses FloatType

2015-04-13 Thread Yin Huai (JIRA)
Yin Huai created SPARK-6887:
---

 Summary: ColumnBuilder misses FloatType
 Key: SPARK-6887
 URL: https://issues.apache.org/jira/browse/SPARK-6887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6887) ColumnBuilder misses FloatType

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6887:
---

Assignee: Yin Huai  (was: Apache Spark)

 ColumnBuilder misses FloatType
 --

 Key: SPARK-6887
 URL: https://issues.apache.org/jira/browse/SPARK-6887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.4.0


 To reproduce ...
 {code}
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.Row
 val schema = StructType(StructField(c, FloatType, true) :: Nil)
 val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat))
 sqlContext.createDataFrame(rdd, schema).registerTempTable(test)
 sqlContext.sql(cache table test)
 sqlContext.table(test).show
 {code}
 The exception is ...
 {code}
 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at 
 SparkPlan.scala:88, took 0.474392 s
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
 (TID 5, localhost): java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292)
   at 
 org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290)
   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6887) ColumnBuilder misses FloatType

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493186#comment-14493186
 ] 

Apache Spark commented on SPARK-6887:
-

User 'yhuai' has created a pull request for this issue:
https://github.com/apache/spark/pull/5499

 ColumnBuilder misses FloatType
 --

 Key: SPARK-6887
 URL: https://issues.apache.org/jira/browse/SPARK-6887
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.4.0
Reporter: Yin Huai
Assignee: Yin Huai
 Fix For: 1.4.0


 To reproduce ...
 {code}
 import org.apache.spark.sql.types._
 import org.apache.spark.sql.Row
 val schema = StructType(StructField(c, FloatType, true) :: Nil)
 val rdd = sc.parallelize(1 to 100).map(i = Row(i.toFloat))
 sqlContext.createDataFrame(rdd, schema).registerTempTable(test)
 sqlContext.sql(cache table test)
 sqlContext.table(test).show
 {code}
 The exception is ...
 {code}
 15/04/13 15:00:12 INFO DAGScheduler: Job 0 failed: collect at 
 SparkPlan.scala:88, took 0.474392 s
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in 
 stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 
 (TID 5, localhost): java.lang.ClassCastException: 
 org.apache.spark.sql.catalyst.expressions.MutableFloat cannot be cast to 
 org.apache.spark.sql.catalyst.expressions.MutableLong
   at 
 org.apache.spark.sql.catalyst.expressions.SpecificMutableRow.setLong(SpecificMutableRow.scala:292)
   at 
 org.apache.spark.sql.columnar.compression.LongDelta$Decoder.next(compressionSchemes.scala:539)
   at 
 org.apache.spark.sql.columnar.compression.CompressibleColumnAccessor$class.extractSingle(CompressibleColumnAccessor.scala:37)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractSingle(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.BasicColumnAccessor.extractTo(ColumnAccessor.scala:54)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.org$apache$spark$sql$columnar$NullableColumnAccessor$$super$extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.NullableColumnAccessor$class.extractTo(NullableColumnAccessor.scala:52)
   at 
 org.apache.spark.sql.columnar.NativeColumnAccessor.extractTo(ColumnAccessor.scala:64)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:295)
   at 
 org.apache.spark.sql.columnar.InMemoryColumnarTableScan$$anonfun$8$$anonfun$13$$anon$2.next(InMemoryColumnarTableScan.scala:290)
   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:130)
   at 
 org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:126)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:640)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
   at 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
   at org.apache.spark.scheduler.Task.run(Task.scala:64)
   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:210)
   at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   at java.lang.Thread.run(Thread.java:745)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6820) Convert NAs to null type in SparkR DataFrames

2015-04-13 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493223#comment-14493223
 ] 

Antonio Piccolboni commented on SPARK-6820:
---

For the distinction between NAs and NUlls in R, see 
http://www.r-bloggers.com/r-na-vs-null/  This seems a fairly dangerous move, 
but I don't have a good alternative to suggest. This is a valid data frame


dd -
structure(list(c.1..2..NA. = c(1, 2, NA), V2 = list(1, 2, NULL)), .Names = 
c(c.1..2..NA., 
V2), row.names = c(NA, -3L), class = data.frame)

dd[3,1] == dd[3,2][[1]]

How often real code relies on list columns that can contain nulls, I am not 
sure.

 Convert NAs to null type in SparkR DataFrames
 -

 Key: SPARK-6820
 URL: https://issues.apache.org/jira/browse/SPARK-6820
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 While converting RDD or local R DataFrame to a SparkR DataFrame we need to 
 handle missing values or NAs.
 We should convert NAs to SparkSQL's null type to handle the conversion 
 correctly



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493252#comment-14493252
 ] 

Sean Owen commented on SPARK-6889:
--

For those that would like to comment directly on the two documents in an 
easier-to-user interface, they are available at:

SparkProjectMechanicsChallenges
https://docs.google.com/document/d/1eV7hWvVPLuZEtvjl72_qYx1iraYKuPzoZWFxtdL99QI/edit?usp=sharing

ContributingToSpark
https://docs.google.com/document/d/1tB9-f9lmxhC32QlOo4E8Z7eGDwHx1_Q3O8uCmRXQTo8/edit?usp=sharing

But you can comment here too, and I will eventually update the PDFs if needed 
so that the latest discussion can be seen here promptly.

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Evan Chan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493303#comment-14493303
 ] 

Evan Chan commented on SPARK-6703:
--

Hey folks,

Thought I would just put in my 2 cents as the author of the Spark Jobserver.  

What is the envisioned way for multiple applications to share the same 
SparkContext?   Code has to be running in the same JVM, and for most 
applications there already must exist some shared knowledge of the framework or 
environment.  This will affect whether this feature is useful or not.

For example, the Spark Jobserver requires jobs to implement an interface, and 
also manages creation of the SparkContext.  That way, jobs get the SparkContext 
through a method call, and we can have other method calls to do things like 
input validation. 

What I'm saying is that this feature would have little existing value to job 
server users, as jobs in job server already have a way to discover the existing 
context, and to implement a good RESTful API, for example.

Another thing to think about is what about SQLContext, HiveContext.  I realize 
there is the JDBC server, but in job server we have a way to pass in 
alternative forms of the contexts.  I suppose you could then add this method to 
a static SQLContext singleton as well. 

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6881) Change the checkpoint directory name from checkpoints to checkpoint

2015-04-13 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-6881.
--
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5493
[https://github.com/apache/spark/pull/5493]

 Change the checkpoint directory name from checkpoints to checkpoint
 ---

 Key: SPARK-6881
 URL: https://issues.apache.org/jira/browse/SPARK-6881
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Hao
Priority: Trivial
 Fix For: 1.4.0


 Name checkpoint instead of checkpoints is included in .gitignore



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493274#comment-14493274
 ] 

Patrick Wendell commented on SPARK-6511:


Can we just run HADOOP_HOME/bin/hadoop classpath and then capture the result? 
I'm wondering if there is a standard interface here we can expect most Hadoop 
distributions to have.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6890) Local cluster mode in Mac is broken

2015-04-13 Thread Davies Liu (JIRA)
Davies Liu created SPARK-6890:
-

 Summary: Local cluster mode in Mac is broken
 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Blocker


The worker can not be launched, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6151) schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size

2015-04-13 Thread Littlestar (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491881#comment-14491881
 ] 

Littlestar edited comment on SPARK-6151 at 4/14/15 1:04 AM:


The HDFS Block Size is set once when you first install Hadoop.
blockSize can be changed when File create, but spark has no way to change 
blockSize.
FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean 
overwrite, int bufferSize, short replication, long blockSize) throws IOException




was (Author: cnstar9988):
The HDFS Block Size is set once when you first install Hadoop.
blockSize can be changed when File create.
FSDataOutputStream org.apache.hadoop.fs.FileSystem.create(Path f, boolean 
overwrite, int bufferSize, short replication, long blockSize) throws IOException



 schemaRDD to parquetfile with saveAsParquetFile control the HDFS block size
 ---

 Key: SPARK-6151
 URL: https://issues.apache.org/jira/browse/SPARK-6151
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.2.1
Reporter: Littlestar
Priority: Trivial

 How schemaRDD to parquetfile with saveAsParquetFile control the HDFS block 
 size. may be Configuration need.
 related question by others.
 http://apache-spark-user-list.1001560.n3.nabble.com/HDFS-block-size-for-parquet-output-tt21183.html
 http://qnalist.com/questions/5054892/spark-sql-parquet-and-impala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6877) Add code generation support for Min

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6877?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6877.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5487
[https://github.com/apache/spark/pull/5487]

 Add code generation support for Min
 ---

 Key: SPARK-6877
 URL: https://issues.apache.org/jira/browse/SPARK-6877
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Reporter: Liang-Chi Hsieh
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493448#comment-14493448
 ] 

Andrew Or commented on SPARK-6890:
--

I'm not actively working on this. Feel free to fix it since you and Nishkam 
have more experience in that part of the code.

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6891) ExecutorAllocationManager will request negative number executors

2015-04-13 Thread meiyoula (JIRA)
meiyoula created SPARK-6891:
---

 Summary: ExecutorAllocationManager will request negative number 
executors
 Key: SPARK-6891
 URL: https://issues.apache.org/jira/browse/SPARK-6891
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: meiyoula
Priority: Critical


Below is the exception:
15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread 
spark-dynamic-executor-allocation-0
java.lang.IllegalArgumentException: Attempted to request a negative number of 
executor(s) -1 from the cluster manager. Please specify a positive number!
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342)
at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170)
at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
at 
org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

Below is the configurations I  setted:
 spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors   0
spark.dynamicAllocation.initialExecutors3
spark.dynamicAllocation.maxExecutors7
spark.dynamicAllocation.executorIdleTimeout 30
spark.shuffle.service.enabled   true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493254#comment-14493254
 ] 

Patrick Wendell commented on SPARK-6889:


Thanks for posting this Sean. Overall, I think this is a big improvement. Some 
comments on the proposed JIRA workflow changes:

1. I think logically Affects Version/s is required only for bugs, right? Is 
there a well defined meaning for Affects Version/s for a new feature that is 
distinct from Target Version/s?
2. I am not sure you can restrict certain priority levels to certain roles, but 
if so that would be really nice.

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5931) Use consistent naming for time properties

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-5931:
-
Assignee: Ilya Ganelin  (was: Andrew Or)

 Use consistent naming for time properties
 -

 Key: SPARK-5931
 URL: https://issues.apache.org/jira/browse/SPARK-5931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Ilya Ganelin
 Fix For: 1.4.0


 This is SPARK-5932's sister issue.
 The naming of existing time configs is inconsistent. We currently have the 
 following throughout the code base:
 {code}
 spark.network.timeout // seconds
 spark.executor.heartbeatInterval // milliseconds
 spark.storage.blockManagerSlaveTimeoutMs // milliseconds
 spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
 {code}
 Instead, my proposal is to simplify the config name itself and make 
 everything accept time using the following format: 5s, 2ms, 100us. For 
 instance:
 {code}
 spark.network.timeout = 5s
 spark.executor.heartbeatInterval = 500ms
 spark.storage.blockManagerSlaveTimeout = 100ms
 spark.yarn.scheduler.heartbeatInterval = 400ms
 {code}
 All existing configs that are relevant will be deprecated in favor of the new 
 ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-04-13 Thread Yu Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493270#comment-14493270
 ] 

Yu Gao commented on SPARK-5111:
---

Hi Zhan, which spark version is going to have this fix? We ran into the same 
issue with Hadoop 2.6 + Kerberos, so would like to see this fixed in Spark. 
Thanks.

 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-5931) Use consistent naming for time properties

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-5931.

   Resolution: Fixed
Fix Version/s: 1.4.0

 Use consistent naming for time properties
 -

 Key: SPARK-5931
 URL: https://issues.apache.org/jira/browse/SPARK-5931
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Andrew Or
Assignee: Ilya Ganelin
 Fix For: 1.4.0


 This is SPARK-5932's sister issue.
 The naming of existing time configs is inconsistent. We currently have the 
 following throughout the code base:
 {code}
 spark.network.timeout // seconds
 spark.executor.heartbeatInterval // milliseconds
 spark.storage.blockManagerSlaveTimeoutMs // milliseconds
 spark.yarn.scheduler.heartbeat.interval-ms // milliseconds
 {code}
 Instead, my proposal is to simplify the config name itself and make 
 everything accept time using the following format: 5s, 2ms, 100us. For 
 instance:
 {code}
 spark.network.timeout = 5s
 spark.executor.heartbeatInterval = 500ms
 spark.storage.blockManagerSlaveTimeout = 100ms
 spark.yarn.scheduler.heartbeatInterval = 400ms
 {code}
 All existing configs that are relevant will be deprecated in favor of the new 
 ones. We should do this soon before we keep introducing more time configs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6703:
---

Assignee: Apache Spark  (was: Ilya Ganelin)

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Apache Spark
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493306#comment-14493306
 ] 

Apache Spark commented on SPARK-6703:
-

User 'ilganeli' has created a pull request for this issue:
https://github.com/apache/spark/pull/5501

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6703) Provide a way to discover existing SparkContext's

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6703:
---

Assignee: Ilya Ganelin  (was: Apache Spark)

 Provide a way to discover existing SparkContext's
 -

 Key: SPARK-6703
 URL: https://issues.apache.org/jira/browse/SPARK-6703
 Project: Spark
  Issue Type: New Feature
  Components: Spark Core
Affects Versions: 1.3.0
Reporter: Patrick Wendell
Assignee: Ilya Ganelin
Priority: Critical

 Right now it is difficult to write a Spark application in a way that can be 
 run independently and also be composed with other Spark applications in an 
 environment such as the JobServer, notebook servers, etc where there is a 
 shared SparkContext.
 It would be nice to provide a rendez-vous point so that applications can 
 learn whether an existing SparkContext already exists before creating one.
 The most simple/surgical way I see to do this is to have an optional static 
 SparkContext singleton that people can be retrieved as follows:
 {code}
 val sc = SparkContext.getOrCreate(conf = new SparkConf())
 {code}
 And you could also have a setter where some outer framework/server can set it 
 for use by multiple downstream applications.
 A more advanced version of this would have some named registry or something, 
 but since we only support a single SparkContext in one JVM at this point 
 anyways, this seems sufficient and much simpler. Another advanced option 
 would be to allow plugging in some other notion of configuration you'd pass 
 when retrieving an existing context.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6890) Local cluster mode in Mac is broken

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6890:
-
Affects Version/s: 1.4.0

 Local cluster mode in Mac is broken
 ---

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Blocker

 The worker can not be launched, 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6890:
-
Assignee: Marcelo Vanzin  (was: Andrew Or)

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Marcelo Vanzin
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493447#comment-14493447
 ] 

Marcelo Vanzin commented on SPARK-6890:
---

Also, another possible way to fix this is to pass the location of the assembly 
jar to the {{Main}} class, instead of the current code. That was my original 
suggestion when Nishkam was working on this. It makes the code a little uglier 
(to allow for plumbing that path through the code), but it would allow 
maintaining the behavior added by that patch while probably fixing this issue.

Let me know if you're working on this, Andrew, otherwise I can do that.

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6890:
---

Assignee: Apache Spark  (was: Marcelo Vanzin)

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Apache Spark
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493478#comment-14493478
 ] 

Apache Spark commented on SPARK-6890:
-

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/5504

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Marcelo Vanzin
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-6890:
---

Assignee: Marcelo Vanzin  (was: Apache Spark)

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Marcelo Vanzin
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5111) HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5

2015-04-13 Thread Zhan Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493483#comment-14493483
 ] 

Zhan Zhang commented on SPARK-5111:
---

[~crystal_gaoyu] I am not sure. You may try to patch the spark by yourself and 
give it a try.

 HiveContext and Thriftserver cannot work in secure cluster beyond hadoop2.5
 ---

 Key: SPARK-5111
 URL: https://issues.apache.org/jira/browse/SPARK-5111
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Zhan Zhang

 Due to java.lang.NoSuchFieldError: SASL_PROPS error. Need to backport some 
 hive-0.14 fix into spark, since there is no effort to upgrade hive to 0.14 
 support in spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-6888) Make DriverQuirks editable

2015-04-13 Thread Rene Treffer (JIRA)
Rene Treffer created SPARK-6888:
---

 Summary: Make DriverQuirks editable
 Key: SPARK-6888
 URL: https://issues.apache.org/jira/browse/SPARK-6888
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Rene Treffer
Priority: Minor


JDBC type conversion is currently handled by spark with the help of 
DriverQuirks (org.apache.spark.sql.jdbc.DriverQuirks).

However some cases can't be resolved, e.g. MySQL BIGINT UNSIGNED. (other 
UNSIGNED conversions won't work either but could be resolved automatically by 
using the next larger type)
An invalid type conversion (e.g. loading an unsigned bigint with the highest 
bit set as a long value) causes the jdbc driver to throw an exception.

The target type is determined automatically and bound to the resulting 
DataFrame where it's immutable.

Alternative solutions:
- Subqueries. Produce extra load on the server
- SQLContext / jdbc methods with schema support
- Making it possible to change the schema of data frames



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6872) external sort need to copy

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6872.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 5481
[https://github.com/apache/spark/pull/5481]

 external sort need to copy
 --

 Key: SPARK-6872
 URL: https://issues.apache.org/jira/browse/SPARK-6872
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-13 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493218#comment-14493218
 ] 

Yu Ishikawa edited comment on SPARK-6682 at 4/14/15 12:37 AM:
--

I meant (a). I agree with that we only add a script for running all of the 
examples to make sure they run. I think adding unit testing suites for examples 
is a better way to do. Although this point may not be the scope of this issue, 
it is a good timing to add test suites with this issue. Thanks.


was (Author: yuu.ishik...@gmail.com):
I meant (a). I agree with that we only add a script for running all of the 
examples to make sure they run. I think adding unit testing suites for examples 
is a better way to do. Although this point may be not the scope of this issue, 
it is a good timing to add test suites with this issue. Thanks.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6890:
-
Summary: Local cluster mode is broken  (was: Local cluster mode in Mac is 
broken)

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6890) Local cluster mode in Mac is broken

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or updated SPARK-6890:
-
Priority: Critical  (was: Blocker)

 Local cluster mode in Mac is broken
 ---

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493394#comment-14493394
 ] 

Nicholas Chammas commented on SPARK-6889:
-

Thanks for continuing to work on improving the contribution process, [~srowen].

The changes you are proposing look great to me (especially regarding JIRA 
workflow and permissions), and I wholeheartedly agree with your conclusion 
about directing non-committer attention and energy appropriately:

{quote}
I first want to figure out how to better direct the enthusiasm of new 
contributors instead, to make less wasted effort and thus less work all around.
{quote}

I'd like to add a few suggestions for everyone's consideration.

\\
1. Our contribution guide will work better when a) it is more visible and b) 
specific parts of it can be referenced easily.

  a) Visibility: Maybe it's just me, but to me the wiki feels like a dusty 
warehouse in a quiet part of town. People just don't go there often. The 
high-traffic area we already have to present contribution guidelines is [in the 
repo itself|https://github.com/apache/spark/blob/master/CONTRIBUTING.md]. I 
would favor moving the contributing guide wholesale there and reducing the wiki 
version to a link.

  b) Easy References: Our contributing guide is already quite lengthy. For the 
newcomer it will definitely be onerous to read through. This is unavoidable for 
the time being, but it does mean that people will continue to (as they have 
been) contribute without reading the whole guide or reading the guide at all.

This means we'll want to direct people to the appropriate parts of the guide 
when relevant. So I think being able to link to specific sections is very 
important.

\\
2. We need to give more importance, culturally, to the process of turning down 
or redirecting work that does not fit Spark's current roadmap. And that also 
needs to be reflected in our contribution guide.

Not having this culture, as far as I can tell, is the #2 reason we have so many 
open, stale PRs, which amount to wasted work and unhappy contributors. (The #1 
reason is that there is simply not enough committer time to go around). 

This is addressed in the proposed contributing guide under the sub-section 
Choosing What to Contribute, but I think it needs to be much more prominent 
and easily reference-able. To me, this is much more important than describing 
the mechanics of using JIRA/GitHub (though, of course, that is still necessary).

To provide a motivating example, take a look at the [contributing guide for the 
Phabricator 
project|https://secure.phabricator.com/book/phabcontrib/article/contributing_code/].
 There is a large section dedicated to explaining why a patch might be 
rejected. Furthermore, the guide gives top prominence to the importance of 
coordinating first before contributing non-trivial changes.

[Phabricator - Contributing 
Code|https://secure.phabricator.com/book/phabcontrib/article/contributing_code/]:
{quote}
h3. Coordinate First

...

h3. Rejecting Patches

If you send us a patch without coordinating it with us first, it will probably 
be immediately rejected, or sit in limbo for a long time and eventually be 
rejected. The reasons we do this vary from patch to patch, but some of the most 
common reasons are:

...
{quote}

More importantly, the Phabricator core devs back up this guide with effective 
action.

For example, take a look at [this 
exchange|https://secure.phabricator.com/D9724#79498] between Evan Priestley 
(one of the project's leads) and a contributor, where Evan gives a firm but 
appropriate no to a proposed patch.

[Phabricator - Allow searching pholio mocks by 
project|https://secure.phabricator.com/D9724#79498]:
{quote}
Phabricator moves pretty quickly, especially given how small the core team is. 
A big part of that is being aggressive about avoiding and reducing technical 
debt. This patch -- and patches like it -- add technical debt by solving a 
problem with a planned long-term solution in a short-term way.

The benefit you get from us saying no here is that the project as a whole 
moves faster.
{quote}

I would love to see more Spark committers doing this on a regular basis.

I'm sure people will at first feel uncomfortable about turning down work 
directly because it somehow feels rude, even if that work doesn't fit Spark's 
roadmap or is somehow otherwise off. But with the right communication and the 
long-term health of the project in mind, we can make it into a good habit that 
benefits both committers and contributors.

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: 

[jira] [Resolved] (SPARK-6303) Remove unnecessary Average in GeneratedAggregate

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-6303.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4996
[https://github.com/apache/spark/pull/4996]

 Remove unnecessary Average in GeneratedAggregate
 

 Key: SPARK-6303
 URL: https://issues.apache.org/jira/browse/SPARK-6303
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Liang-Chi Hsieh
Priority: Minor
 Fix For: 1.4.0


 Because {{Average}} is a {{PartialAggregate}}, we never get a {{Average}} 
 node when reaching {{HashAggregation}} to prepare {{GeneratedAggregate}}.
 That is why in SQLQuerySuite there is already a test for {{avg}} with 
 codegen. And it works.
 But we can find a case in {{GeneratedAggregate}} to deal with {{Average}}. 
 Based on the above, we actually never execute this case.
 So we can remove this case from {{GeneratedAggregate}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493283#comment-14493283
 ] 

Marcelo Vanzin commented on SPARK-6511:
---

I think {{hadoop classpath}} would be safer w.r.t. compatibility, if you don't 
mind the extra overhead (it launches a JVM). One thing to remember in that case 
is to use the {{--config}} parameter to point to the actual config directory 
being used.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5888:
---

Assignee: Apache Spark  (was: Sandy Ryza)

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Apache Spark

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-5888:
---

Assignee: Sandy Ryza  (was: Apache Spark)

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-5888) Add OneHotEncoder as a Transformer

2015-04-13 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5888?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493302#comment-14493302
 ] 

Apache Spark commented on SPARK-5888:
-

User 'sryza' has created a pull request for this issue:
https://github.com/apache/spark/pull/5500

 Add OneHotEncoder as a Transformer
 --

 Key: SPARK-5888
 URL: https://issues.apache.org/jira/browse/SPARK-5888
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Reporter: Xiangrui Meng
Assignee: Sandy Ryza

 `OneHotEncoder` takes a categorical column and output a vector column, which 
 stores the category info in binaries.
 {code}
 val ohe = new OneHotEncoder()
   .setInputCol(countryIndex)
   .setOutputCol(countries)
 {code}
 It should read the category info from the metadata and assign feature names 
 properly in the output column. We need to discuss the default naming scheme 
 and whether we should let it process multiple categorical columns at the same 
 time.
 One category (the most frequent one) should be removed from the output to 
 make the output columns linear independent. Or this could be an option tuned 
 on by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4766) ML Estimator Params should subclass Transformer Params

2015-04-13 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4766:
-
Target Version/s: 1.4.0  (was: 1.3.0)

 ML Estimator Params should subclass Transformer Params
 --

 Key: SPARK-4766
 URL: https://issues.apache.org/jira/browse/SPARK-4766
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 1.2.0
Reporter: Joseph K. Bradley

 Currently, in spark.ml, both Transformers and Estimators extend the same 
 Params classes.  There should be one Params class for the Transformer and one 
 for the Estimator, where the Estimator params class extends the Transformer 
 one.
 E.g., it is weird to be able to do:
 {code}
 val model: LogisticRegressionModel = ...
 model.getMaxIter()
 {code}
 It's also weird to be able to:
 * Wrap LogisticRegressionModel (a Transformer) with CrossValidator
 * Pass a set of ParamMaps to CrossValidator which includes parameter 
 LogisticRegressionModel.maxIter
 * (CrossValidator would try to set that parameter.)
 * I'm not sure if this would cause a failure or just be a noop.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6890) Local cluster mode is broken

2015-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493432#comment-14493432
 ] 

Marcelo Vanzin commented on SPARK-6890:
---

Do you have `SPARK_PREPEND_CLASSES` set by any chance?

BTW, personally, I was against the change that causes this failure, and again 
personally I wouldn't really be against reverting it. It seems to cause more 
issues than it solves. [/cc [~nravi]]

 Local cluster mode is broken
 

 Key: SPARK-6890
 URL: https://issues.apache.org/jira/browse/SPARK-6890
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.4.0
Reporter: Davies Liu
Assignee: Andrew Or
Priority: Critical

 In master, local cluster mode is broken. If I run `bin/spark-submit --master 
 local-cluster[2,1,512]`, my executors keep failing with class not found 
 exception. It appears that the assembly jar is not added to the executors' 
 class paths. I suspect that this is caused by 
 https://github.com/apache/spark/pull/5085.
 {code}
 Exception in thread main java.lang.NoClassDefFoundError: scala/Option
   at java.lang.Class.getDeclaredMethods0(Native Method)
   at java.lang.Class.privateGetDeclaredMethods(Class.java:2531)
   at java.lang.Class.getMethod0(Class.java:2774)
   at java.lang.Class.getMethod(Class.java:1663)
   at sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
   at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
 Caused by: java.lang.ClassNotFoundException: scala.Option
   at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
   at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6891) ExecutorAllocationManager will request negative number executors

2015-04-13 Thread meiyoula (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

meiyoula updated SPARK-6891:

Description: 
Below is the exception:
   15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread 
spark-dynamic-executor-allocation-0
java.lang.IllegalArgumentException: Attempted to request a negative 
number of executor(s) -1 from the cluster manager. Please specify a positive 
number!
 at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342)
 at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170)
 at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
 at 
org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
 at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
 at 
org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723)
 at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
 at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
 at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

Below is the configurations I  setted:
   spark.dynamicAllocation.enabled true
   spark.dynamicAllocation.minExecutors   0
   spark.dynamicAllocation.initialExecutors3
   spark.dynamicAllocation.maxExecutors7
   spark.dynamicAllocation.executorIdleTimeout 30
   spark.shuffle.service.enabled   true

  was:
Below is the exception:
15/04/14 10:10:18 ERROR Utils: Uncaught exception in thread 
spark-dynamic-executor-allocation-0
java.lang.IllegalArgumentException: Attempted to request a negative number of 
executor(s) -1 from the cluster manager. Please specify a positive number!
at 
org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:342)
at 
org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1170)
at 
org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
at 
org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
at 
org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1723)
at 
org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:189)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at 
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:178)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)

Below is the 

[jira] [Updated] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-6889:
-
Attachment: ContributingtoSpark.pdf
SparkProjectMechanicsChallenges.pdf

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5941) Unit Test loads the table `src` twice for leftsemijoin.q

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5941.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4506
[https://github.com/apache/spark/pull/4506]

 Unit Test loads the table `src` twice for leftsemijoin.q
 

 Key: SPARK-5941
 URL: https://issues.apache.org/jira/browse/SPARK-5941
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Cheng Hao
 Fix For: 1.4.0


 In leftsemijoin.q, there is a data loading command for table sales already, 
 but in TestHive, it also creates/loads the table sales, which causes 
 duplicated records inserted into the sales.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-13 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493251#comment-14493251
 ] 

Kannan Rajah commented on SPARK-1529:
-

You can use the Compare functionality to see a single page of diffs across 
commits. Here is the link: 
https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6

A few clarifications.
1. There are 2 reasons for introducing a FileSystem abstraction in Spark 
instead of directly using Hadoop FileSystem.
  - There are Spark shuffle specific APIs that needed abstraction. Please take 
a look at this code:
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala

  - For local file system access, we can choose to circumvent using Hadoop's 
local file system implementation if its not efficient. If you look at 
LocalFileSystem.scala, for most APIs, it just delegates to the old code of 
using Spark's disk block manager, etc. In fact, we can just look at this single 
class and determine if we will hit any performance degradation for the default 
Apache shuffle code path.
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala

2. During the write phase, we shuffle to HDFS instead of local file system. 
While reading back, we don't use the Netty based transport that Apache shuffle 
uses. Instead we have a new implementation called DFSShuffleClient that reads 
from HDFS. That is the main difference.
https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6889) Streamline contribution process with update to Contribution wiki, JIRA rules

2015-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493299#comment-14493299
 ] 

Marcelo Vanzin commented on SPARK-6889:
---

I left a couple of comments on the docs themselves, but overall this looks like 
a good direction to follow.

I've seen some projects with public issue trackers (KDE is one, I believe 
Mozilla does the same) where all new bugs are created with status 
unconfirmed, and only someone with appropriate permissions can transition the 
bug to open. That is similar to Sean's suggestion of having the default be 
Question / Minor, in a way, but I wonder how much that actually helps with 
managing open issues.

 Streamline contribution process with update to Contribution wiki, JIRA rules
 

 Key: SPARK-6889
 URL: https://issues.apache.org/jira/browse/SPARK-6889
 Project: Spark
  Issue Type: Improvement
  Components: Documentation
Reporter: Sean Owen
Assignee: Sean Owen
 Attachments: ContributingtoSpark.pdf, 
 SparkProjectMechanicsChallenges.pdf


 From about 6 months of intimate experience with the Spark JIRA and the 
 reality of the JIRA / PR flow, I've observed some challenges, problems and 
 growing pains that have begun to encumber the project mechanics. In the 
 attached SparkProjectMechanicsChallenges.pdf document, I've collected these 
 observations and a few statistics that summarize much of what I've seen. From 
 side conversations with several of you, I think some of these will resonate. 
 (Read it first for this to make sense.)
 I'd like to improve just one aspect to start: the contribution process. A lot 
 of inbound contribution effort gets misdirected, and can burn a lot of cycles 
 for everyone, and that's a barrier to scaling up further and to general 
 happiness. I'd like to propose for discussion a change to the wiki pages, and 
 a change to some JIRA settings. 
 *Wiki*
 - Replace 
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark with 
 proposed text (NewContributingToSpark.pdf)
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Reviewing+and+Merging+Patches
  as it is subsumed by the new text
 - Move the IDE Setup section to 
 https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools
 - Delete 
 https://cwiki.apache.org/confluence/display/SPARK/Jira+Permissions+Scheme as 
 it's a  bit out of date and not all that useful
 *JIRA*
 Now: 
 Start by removing everyone from the 'Developer' role and add them to 
 'Contributor'. Right now Developer has no permission that Contributor 
 doesn't. We may reuse Developer later for some level between Committer and 
 Contributor.
 Later, with Apache admin assistance:
 - Make Component and Affects Version required for new JIRAs
 - Set default priority to Minor and type to Question for new JIRAs. If 
 defaults aren't changed, by default it can't be that important
 - Only let Committers set Target Version and Fix Version
 - Only let Committers set Blocker Priority



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-4848) Allow different Worker configurations in standalone cluster

2015-04-13 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4848?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or closed SPARK-4848.

  Resolution: Fixed
   Fix Version/s: 1.4.0
Assignee: Nathan Kronenfeld
Target Version/s: 1.4.0

 Allow different Worker configurations in standalone cluster
 ---

 Key: SPARK-4848
 URL: https://issues.apache.org/jira/browse/SPARK-4848
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.0.0
 Environment: stand-alone spark cluster
Reporter: Nathan Kronenfeld
Assignee: Nathan Kronenfeld
 Fix For: 1.4.0

   Original Estimate: 24h
  Remaining Estimate: 24h

 On a stand-alone spark cluster, much of the determination of worker 
 specifics, especially one has multiple instances per node, is done only on 
 the master.
 The master loops over instances, and starts a worker per instance on each 
 node.
 This means, if your workers have different values of SPARK_WORKER_INSTANCES 
 or SPARK_WORKER_WEBUI_PORT from each other (or from the master), all values 
 are ignored except the one on the master.
 SPARK_WORKER_PORT looks like it is unread in scripts, but read in code - I'm 
 not sure how it will behave, since all instances will read the same value 
 from the environment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-5794) add jar should return 0

2015-04-13 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-5794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-5794.
-
   Resolution: Fixed
Fix Version/s: 1.4.0

Issue resolved by pull request 4586
[https://github.com/apache/spark/pull/4586]

 add jar should return 0
 ---

 Key: SPARK-5794
 URL: https://issues.apache.org/jira/browse/SPARK-5794
 Project: Spark
  Issue Type: Bug
  Components: SQL
Reporter: Adrian Wang
Priority: Minor
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3727) Trees and ensembles: More prediction functionality

2015-04-13 Thread Michael Kuhlen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493455#comment-14493455
 ] 

Michael Kuhlen commented on SPARK-3727:
---

[~josephkb] The design document is great, thanks for sharing. Looks like a 
great step forward. I'd be happy to work on either or both of the subtasks, but 
note that I'm going to have to be a weekend warrior on this stuff (busy at 
work during the week). I'm going to start by familiarizing myself with spark.ml 
and the new API, to see if and how to port over the changes I've made so far.

 Trees and ensembles: More prediction functionality
 --

 Key: SPARK-3727
 URL: https://issues.apache.org/jira/browse/SPARK-3727
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Reporter: Joseph K. Bradley

 DecisionTree and RandomForest currently predict the most likely label for 
 classification and the mean for regression.  Other info about predictions 
 would be useful.
 For classification: estimated probability of each possible label
 For regression: variance of estimate
 RandomForest could also create aggregate predictions in multiple ways:
 * Predict mean or median value for regression.
 * Compute variance of estimates (across all trees) for both classification 
 and regression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6682) Deprecate static train and use builder instead for Scala/Java

2015-04-13 Thread Yu Ishikawa (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493218#comment-14493218
 ] 

Yu Ishikawa commented on SPARK-6682:


I meant (a). I agree with that we only add a script for running all of the 
examples to make sure they run. I think adding unit testing suites for examples 
is a better way to do. Although this point may be not the scope of this issue, 
it is a good timing to add test suites with this issue. Thanks.

 Deprecate static train and use builder instead for Scala/Java
 -

 Key: SPARK-6682
 URL: https://issues.apache.org/jira/browse/SPARK-6682
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 1.3.0
Reporter: Joseph K. Bradley

 In MLlib, we have for some time been unofficially moving away from the old 
 static train() methods and moving towards builder patterns.  This JIRA is to 
 discuss this move and (hopefully) make it official.
 Old static train() API:
 {code}
 val myModel = NaiveBayes.train(myData, ...)
 {code}
 New builder pattern API:
 {code}
 val nb = new NaiveBayes().setLambda(0.1)
 val myModel = nb.train(myData)
 {code}
 Pros of the builder pattern:
 * Much less code when algorithms have many parameters.  Since Java does not 
 support default arguments, we required *many* duplicated static train() 
 methods (for each prefix set of arguments).
 * Helps to enforce default parameters.  Users should ideally not have to even 
 think about setting parameters if they just want to try an algorithm quickly.
 * Matches spark.ml API
 Cons of the builder pattern:
 * In Python APIs, static train methods are more Pythonic.
 Proposal:
 * Scala/Java: We should start deprecating the old static train() methods.  We 
 must keep them for API stability, but deprecating will help with API 
 consistency, making it clear that everyone should use the builder pattern.  
 As we deprecate them, we should make sure that the builder pattern supports 
 all parameters.
 * Python: Keep static train methods.
 CC: [~mengxr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6511) Publish hadoop provided build with instructions for different distros

2015-04-13 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6511?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493217#comment-14493217
 ] 

Marcelo Vanzin commented on SPARK-6511:
---

We add a bunch of things to that variable, but the main thing is this:

{code}
  $HADOOP_HOME/client/*
{code}

I'm not sure whether other distributions place libraries in that location, or 
if that's a CDH-only thing.

 Publish hadoop provided build with instructions for different distros
 ---

 Key: SPARK-6511
 URL: https://issues.apache.org/jira/browse/SPARK-6511
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Patrick Wendell

 Currently we publish a series of binaries with different Hadoop client jars. 
 This mostly works, but some users have reported compatibility issues with 
 different distributions.
 One improvement moving forward might be to publish a binary build that simply 
 asks you to set HADOOP_HOME to pick up the Hadoop client location. That way 
 it would work across multiple distributions, even if they have subtle 
 incompatibilities with upstream Hadoop.
 I think a first step for this would be to produce such a build for the 
 community and see how well it works. One potential issue is that our fancy 
 excludes and dependency re-writing won't work with the simpler append 
 Hadoop's classpath to Spark. Also, how we deal with the Hive dependency is 
 unclear, i.e. should we continue to bundle Spark's Hive (which has some fixes 
 for dependency conflicts) or do we allow for linking against vanilla Hive at 
 runtime.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >