[jira] [Commented] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054080#comment-15054080
 ] 

Yanbo Liang commented on SPARK-11959:
-

OK, I can take this one.

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: (was: Apache Spark)

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: Apache Spark

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12303) Configuration parameter by which can choose if we want the REPL generated class directory name to be random or fixed name.

2015-12-11 Thread piyush (JIRA)
piyush created SPARK-12303:
--

 Summary: Configuration parameter by which  can choose if we want 
the REPL generated class directory name to be random or fixed name.
 Key: SPARK-12303
 URL: https://issues.apache.org/jira/browse/SPARK-12303
 Project: Spark
  Issue Type: Wish
  Components: Spark Shell
Reporter: piyush


 .class generated by spark REPL are stored in a temp directory with random name.
Configuration parameter by which  can choose if we want the REPL generated 
class directory name to be random or fixed name.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-11685.
-
Resolution: Fixed

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: (was: Apache Spark)

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12302:


Assignee: Apache Spark

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Assignee: Apache Spark
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11685:

Comment: was deleted

(was: [~josephkb] I have checked thoroughly, there are no other examples we 
should move or remove and we can resolve this issue.)

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11685:

Comment: was deleted

(was: [~josephkb] I have checked thoroughly, there are no other examples we 
should move or remove and we can resolve this issue.)

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054055#comment-15054055
 ] 

Yanbo Liang commented on SPARK-11685:
-

[~josephkb] I have checked thoroughly, there are no other examples we should 
move or remove and we can resolve this issue.

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054058#comment-15054058
 ] 

Yanbo Liang commented on SPARK-11685:
-

[~josephkb] I have checked thoroughly, there are no other examples we should 
move or remove and we can resolve this issue.

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054056#comment-15054056
 ] 

Yanbo Liang commented on SPARK-11685:
-

[~josephkb] I have checked thoroughly, there are no other examples we should 
move or remove and we can resolve this issue.

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12158) [R] [SQL] Fix 'sample' functions that break R unit test cases

2015-12-11 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman resolved SPARK-12158.
---
   Resolution: Fixed
Fix Version/s: 2.0.0
   1.6.1

Resolved by https://github.com/apache/spark/pull/10160

> [R] [SQL] Fix 'sample' functions that break R unit test cases
> -
>
> Key: SPARK-12158
> URL: https://issues.apache.org/jira/browse/SPARK-12158
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> The existing sample functions miss the parameter 'seed', however, the 
> corresponding function interface in `generics` has such a parameter.  
> This could cause SparkR unit tests failed. For example, I hit it in one PR:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12158) [R] [SQL] Fix 'sample' functions that break R unit test cases

2015-12-11 Thread Shivaram Venkataraman (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shivaram Venkataraman updated SPARK-12158:
--
Assignee: Xiao Li

> [R] [SQL] Fix 'sample' functions that break R unit test cases
> -
>
> Key: SPARK-12158
> URL: https://issues.apache.org/jira/browse/SPARK-12158
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR, SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 1.6.1, 2.0.0
>
>
> The existing sample functions miss the parameter 'seed', however, the 
> corresponding function interface in `generics` has such a parameter.  
> This could cause SparkR unit tests failed. For example, I hit it in one PR:
> https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47213/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054024#comment-15054024
 ] 

holdenk commented on SPARK-2870:


I think it will work fine (I mean schema inferance on RDD of dicts works 
properly now). Its only with local collections we still have the issue.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12267) Standalone master keeps references to disassociated workers until they sent no heartbeats

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12267?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054012#comment-15054012
 ] 

Apache Spark commented on SPARK-12267:
--

User 'zsxwing' has created a pull request for this issue:
https://github.com/apache/spark/pull/10261

> Standalone master keeps references to disassociated workers until they sent 
> no heartbeats
> -
>
> Key: SPARK-12267
> URL: https://issues.apache.org/jira/browse/SPARK-12267
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>
> While toying with Spark Standalone I've noticed the following messages
> in the logs of the master:
> {code}
> INFO Master: Registering worker 192.168.1.6:59919 with 2 cores, 2.0 GB RAM
> INFO Master: localhost:59920 got disassociated, removing it.
> ...
> WARN Master: Removing worker-20151210090708-192.168.1.6-59919 because
> we got no heartbeat in 60 seconds
> INFO Master: Removing worker worker-20151210090708-192.168.1.6-59919
> on 192.168.1.6:59919
> {code}
> Why does the message "WARN Master: Removing
> worker-20151210090708-192.168.1.6-59919 because we got no heartbeat in
> 60 seconds" appear when the worker should've been removed already (as
> pointed out in "INFO Master: localhost:59920 got disassociated,
> removing it.")?
> Could it be that the ids are different - 192.168.1.6:59919 vs localhost:59920?
> I started master using {{./sbin/start-master.sh -h localhost}} and the
> workers {{./sbin/start-slave.sh spark://localhost:7077}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054008#comment-15054008
 ] 

Apache Spark commented on SPARK-12302:
--

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/10273

> Example for servlet filter used by spark.ui.filters
> ---
>
> Key: SPARK-12302
> URL: https://issues.apache.org/jira/browse/SPARK-12302
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples
>Affects Versions: 1.5.2
>Reporter: Kai Sasaki
>Priority: Trivial
>  Labels: examples, security
>
> Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
> often difficult to understand how to write filter code and how to integrate 
> actual spark applications. 
> It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6612) Python KMeans parity

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6612:
-
Comment: was deleted

(was: [~mengxr] This issue is resolved. But it seems Apache Spark made a wrong 
comment here. Could you please check it out ?)

> Python KMeans parity
> 
>
> Key: SPARK-6612
> URL: https://issues.apache.org/jira/browse/SPARK-6612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Hrishikesh
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
> are missing:
> KMeans
> * setEpsilon
> * setInitializationSteps
> KMeansModel
> * computeCost
> * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6612) Python KMeans parity

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6612:
-
Comment: was deleted

(was: User 'FlytxtRnD' has created a pull request for this issue:
https://github.com/apache/spark/pull/5391)

> Python KMeans parity
> 
>
> Key: SPARK-6612
> URL: https://issues.apache.org/jira/browse/SPARK-6612
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Hrishikesh
>Priority: Minor
> Fix For: 1.4.0
>
>
> This is a subtask of [SPARK-6258] for the Python API of KMeans.  These items 
> are missing:
> KMeans
> * setEpsilon
> * setInitializationSteps
> KMeansModel
> * computeCost
> * k



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4591) Algorithm/model parity in spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Target Version/s: 2.0.0
Priority: Critical  (was: Major)
 Description: 
This is an umbrella JIRA for porting spark.mllib implementations to use the 
DataFrame-based API defined under spark.ml.  We want to achieve feature parity 
for the next release.

Create or link subtasks for:
* missing algorithms or models (However, this does NOT include stats or linear 
algebra; those will be handled separately.)
* existing algorithms or models which are missing features, params, etc.

_Note: Please search JIRA for existing issues to avoid duplicates._

  was:This is an umbrella JIRA for porting spark.mllib implementations to adapt 
the new API defined under spark.ml.

 Summary: Algorithm/model parity in spark.ml  (was: Add 
algorithm/model wrappers in spark.ml to adapt the new API)

> Algorithm/model parity in spark.ml
> --
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve feature 
> parity for the next release.
> Create or link subtasks for:
> * missing algorithms or models (However, this does NOT include stats or 
> linear algebra; those will be handled separately.)
> * existing algorithms or models which are missing features, params, etc.
> _Note: Please search JIRA for existing issues to avoid duplicates._



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12232) Consider exporting read.table in R

2015-12-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054003#comment-15054003
 ] 

Felix Cheung commented on SPARK-12232:
--

agreed, `sqlTableToDF` would make sense.

> Consider exporting read.table in R
> --
>
> Key: SPARK-12232
> URL: https://issues.apache.org/jira/browse/SPARK-12232
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> Since we have read.df, read.json, read.parquet (some in pending PRs), we have 
> table() and we should consider having read.table() for consistency and 
> R-likeness.
> However, this conflicts with utils::read.table which returns a R data.frame.
> It seems neither table() or read.table() is desirable in this case.
> table: https://stat.ethz.ch/R-manual/R-devel/library/base/html/table.html
> read.table: 
> https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12302) Example for servlet filter used by spark.ui.filters

2015-12-11 Thread Kai Sasaki (JIRA)
Kai Sasaki created SPARK-12302:
--

 Summary: Example for servlet filter used by spark.ui.filters
 Key: SPARK-12302
 URL: https://issues.apache.org/jira/browse/SPARK-12302
 Project: Spark
  Issue Type: Improvement
  Components: Examples
Affects Versions: 1.5.2
Reporter: Kai Sasaki
Priority: Trivial


Although {{spark.ui.filters}} configuration uses simple servlet filter, it is 
often difficult to understand how to write filter code and how to integrate 
actual spark applications. 

It can be help to write examples for trying secure Spark cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Add algorithm/model wrappers in spark.ml to adapt the new API

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054001#comment-15054001
 ] 

Joseph K. Bradley commented on SPARK-4591:
--

Good point...maybe this can become an umbrella for collecting all remaining 
items.  I'll target it at 2.0.0 for that purpose.

> Add algorithm/model wrappers in spark.ml to adapt the new API
> -
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>
> This is an umbrella JIRA for porting spark.mllib implementations to adapt the 
> new API defined under spark.ml.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7131) Move tree,forest implementation from spark.mllib to spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15054000#comment-15054000
 ] 

Joseph K. Bradley commented on SPARK-7131:
--

Yes, I'm sorry about how long this has taken, but I have enough confidence in 
the API now proceed.  I've created a JIRA for doing this in the next release: 
[SPARK-12301], though I may not be able to look at this issue until January.  
Please post your thoughts there, and ping in early January if there is no 
activity.  Thank you!

> Move tree,forest implementation from spark.mllib to spark.ml
> 
>
> Key: SPARK-7131
> URL: https://issues.apache.org/jira/browse/SPARK-7131
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.5.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> We want to change and improve the spark.ml API for trees and ensembles, but 
> we cannot change the old API in spark.mllib.  To support the changes we want 
> to make, we should move the implementation from spark.mllib to spark.ml.  We 
> will generalize and modify it, but will also ensure that we do not change the 
> behavior of the old API.
> There are several steps to this:
> 1. Copy the implementation over to spark.ml and change the spark.ml classes 
> to use that implementation, rather than calling the spark.mllib 
> implementation.  The current spark.ml tests will ensure that the 2 
> implementations learn exactly the same models.  Note: This should include 
> performance testing to make sure the updated code does not have any 
> regressions. --> *UPDATE*: I have run tests using spark-perf, and there were 
> no regressions.
> 2. Remove the spark.mllib implementation, and make the spark.mllib APIs 
> wrappers around the spark.ml implementation.  The spark.ml tests will again 
> ensure that we do not change any behavior.
> 3. Move the unit tests to spark.ml, and change the spark.mllib unit tests to 
> verify model equivalence.
> This JIRA is now for step 1 only.  Steps 2 and 3 will be in separate JIRAs.
> After these updates, we can more safely generalize and improve the spark.ml 
> implementation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12301) Remove final from classes in spark.ml trees and ensembles where possible

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12301:
-

 Summary: Remove final from classes in spark.ml trees and ensembles 
where possible
 Key: SPARK-12301
 URL: https://issues.apache.org/jira/browse/SPARK-12301
 Project: Spark
  Issue Type: Improvement
  Components: ML
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


There have been continuing requests (e.g., [SPARK-7131]) for allowing users to 
extend and modify MLlib models and algorithms.

I want this to happen for the next release.  For GBT, this may need to wait on 
some refactoring (to move the implementation to spark.ml).  But it could be 
done for trees already.  This will be broken into subtasks.

If you are a user who needs these changes, please comment here about what 
specifically needs to be modified for your use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9578) Stemmer feature transformer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9578:
---

Assignee: (was: Apache Spark)

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9578) Stemmer feature transformer

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9578?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-9578:
---

Assignee: Apache Spark

> Stemmer feature transformer
> ---
>
> Key: SPARK-9578
> URL: https://issues.apache.org/jira/browse/SPARK-9578
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> Transformer mentioned first in [SPARK-5571] based on suggestion from 
> [~aloknsingh].  Very standard NLP preprocessing task.
> From [~aloknsingh]:
> {quote}
> We have one scala stemmer in scalanlp%chalk 
> https://github.com/scalanlp/chalk/tree/master/src/main/scala/chalk/text/analyze
>   which can easily copied (as it is apache project) and is in scala too.
> I think this will be better alternative than lucene englishAnalyzer or 
> opennlp.
> Note: we already use the scalanlp%breeze via the maven dependency so I think 
> adding scalanlp%chalk dependency is also the options. But as you had said we 
> can copy the code as it is small.
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Yin Huai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yin Huai resolved SPARK-12298.
--
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10271
[https://github.com/apache/spark/pull/10271]

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
> Fix For: 2.0.0, 1.6.1
>
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12298:


Assignee: Ankur Dave  (was: Apache Spark)

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12298:


Assignee: Apache Spark  (was: Ankur Dave)

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Apache Spark
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12298:


Assignee: Apache Spark  (was: Ankur Dave)

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Apache Spark
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12298:


Assignee: Ankur Dave  (was: Apache Spark)

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053978#comment-15053978
 ] 

Apache Spark commented on SPARK-12298:
--

User 'ankurdave' has created a pull request for this issue:
https://github.com/apache/spark/pull/10271

> Infinite loop in DataFrame.sortWithinPartitions(String, String*)
> 
>
> Key: SPARK-12298
> URL: https://issues.apache.org/jira/browse/SPARK-12298
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Ankur Dave
>Assignee: Ankur Dave
>
> The String overload of DataFrame.sortWithinPartitions calls itself when it 
> should call the Column overload, causing an infinite loop:
> {code}
> Exception in thread "main" java.lang.StackOverflowError
>   at 
> org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053977#comment-15053977
 ] 

Nicholas Chammas commented on SPARK-2870:
-

> Do you think its OK to close this issue?

I haven't tested 1.6 yet, but yeah if there is a way to get the functional 
equivalent of 

{code}
SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
{code}

without the waste, as explained in the issue description, then I think we're 
good.

But it looks like from your most recent comment that this is not the case.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity

2015-12-11 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh resolved SPARK-12276.
-
Resolution: Won't Fix

> Prevent RejectedExecutionException by checking if ThreadPoolExecutor is 
> shutdown and its capacity
> -
>
> Key: SPARK-12276
> URL: https://issues.apache.org/jira/browse/SPARK-12276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We noticed that it is possible to throw RejectedExecutionException when 
> submitting thread in AppClient. The error is like following. We should add 
> some checks to prevent it.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@2077082c rejected from 
> java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-11 Thread Bryan Cutler (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053963#comment-15053963
 ] 

Bryan Cutler commented on SPARK-12062:
--

Hi [~andrewor14], I have a PR ready for this, so was just wondering if there 
could be any use if it was only enabled as an conf option?

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11685) Find duplicate content under examples/

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053954#comment-15053954
 ] 

Joseph K. Bradley commented on SPARK-11685:
---

[~yanboliang]  Are there other examples which you and [~mengxr] planned to move 
or remove (or if you have not checked thoroughly, could you please)?  Thanks!

> Find duplicate content under examples/
> --
>
> Key: SPARK-11685
> URL: https://issues.apache.org/jira/browse/SPARK-11685
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Yanbo Liang
>
> Because we moved some example code from user guide markdown to examples/ 
> folder. There exists some duplicate content under examples/. We should 
> consider merging the example files in 1.6. Please find possibly duplicate 
> content and create sub-tasks for each of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11978) Move dataset_example.py to examples/ml and rename to dataframe_example.py

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11978.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 9957
[https://github.com/apache/spark/pull/9957]

> Move dataset_example.py to examples/ml and rename to dataframe_example.py
> -
>
> Key: SPARK-11978
> URL: https://issues.apache.org/jira/browse/SPARK-11978
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid 
> confusion.
> SPARK-11895 finished the work of Scala example, here we focus on the Python 
> one. 
> Move dataset_example.py to examples/ml and rename to dataframe_example.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11978) Move dataset_example.py to examples/ml and rename to dataframe_example.py

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11978?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11978:
--
Shepherd: Joseph K. Bradley
Assignee: Yanbo Liang

> Move dataset_example.py to examples/ml and rename to dataframe_example.py
> -
>
> Key: SPARK-11978
> URL: https://issues.apache.org/jira/browse/SPARK-11978
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib, PySpark
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> Since `Dataset` has a new meaning in Spark 1.6, we should rename it to avoid 
> confusion.
> SPARK-11895 finished the work of Scala example, here we focus on the Python 
> one. 
> Move dataset_example.py to examples/ml and rename to dataframe_example.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10258) Add @Since annotation to ml.feature

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10258:
--
Target Version/s:   (was: 1.6.0)

> Add @Since annotation to ml.feature
> ---
>
> Key: SPARK-10258
> URL: https://issues.apache.org/jira/browse/SPARK-10258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Martin Brown
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10264) Add @Since annotation to ml.recoomendation

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10264:
--
Target Version/s:   (was: 1.6.0)

> Add @Since annotation to ml.recoomendation
> --
>
> Key: SPARK-10264
> URL: https://issues.apache.org/jira/browse/SPARK-10264
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Tijo Thomas
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10263) Add @Since annotation to ml.param and ml.*

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-10263:
--
Target Version/s:   (was: 1.6.0)

> Add @Since annotation to ml.param and ml.*
> --
>
> Key: SPARK-10263
> URL: https://issues.apache.org/jira/browse/SPARK-10263
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Hiroshi Takahashi
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-10285) Add @since annotation to pyspark.ml.util

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-10285.
-
  Resolution: Not A Problem
Target Version/s:   (was: 1.6.0)

> Add @since annotation to pyspark.ml.util
> 
>
> Key: SPARK-10285
> URL: https://issues.apache.org/jira/browse/SPARK-10285
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10285) Add @since annotation to pyspark.ml.util

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053933#comment-15053933
 ] 

Joseph K. Bradley commented on SPARK-10285:
---

I'll close the issue.  Thanks!

> Add @since annotation to pyspark.ml.util
> 
>
> Key: SPARK-10285
> URL: https://issues.apache.org/jira/browse/SPARK-10285
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Xiangrui Meng
>Assignee: Yu Ishikawa
>Priority: Minor
>  Labels: starter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12300) Fix schema inferance on local collections

2015-12-11 Thread holdenk (JIRA)
holdenk created SPARK-12300:
---

 Summary: Fix schema inferance on local collections
 Key: SPARK-12300
 URL: https://issues.apache.org/jira/browse/SPARK-12300
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Reporter: holdenk
Priority: Minor


Current schema inferance for local python collections halts as soon as there 
are no NullTypes. This is different than when we specify a sampling ratio of 
1.0 on a distributed collection. This could result in incomplete schema 
information.

Repro:

{code}
input = [{"a": 1}, {"b": "coffee"}]
df = sqlContext.createDataFrame(input)
print df.schema
{code}

Discovered while looking at SPARK-2870



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053929#comment-15053929
 ] 

Andrew Or commented on SPARK-6270:
--

[~shivaram] That may be difficult to do because different applications can 
specify different log directories, whereas now the history server reads all 
logs in the same one. Also it would add complexity to the Master process 
because now it binds to two ports instead of one. I think we should try to keep 
it lightweight in the future and simply rip out this functionality. With Spark 
2.0 I believe we're allowed to do that. :)

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053924#comment-15053924
 ] 

Andrew Or commented on SPARK-6270:
--

I have filed a JIRA for it: https://issues.apache.org/jira/browse/SPARK-12299

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12299) Remove history serving functionality from standalone Master

2015-12-11 Thread Andrew Or (JIRA)
Andrew Or created SPARK-12299:
-

 Summary: Remove history serving functionality from standalone 
Master
 Key: SPARK-12299
 URL: https://issues.apache.org/jira/browse/SPARK-12299
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 1.0.0
Reporter: Andrew Or


The standalone Master currently continues to serve the historical UIs of 
applications that have completed and enabled event logging. This poses 
problems, however, if the event log is very large, e.g. SPARK-6270. The Master 
might OOM or hang while it rebuilds the UI, rejecting applications in the mean 
time.

Personally, I have had to make modifications in the code to disable this 
myself, because I wanted to use event logging in standalone mode for 
applications that produce a lot of logging.

Removing this from the Master would simplify the process significantly. This 
issue supersedes SPARK-12062.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-11 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053923#comment-15053923
 ] 

Andrew Or commented on SPARK-12062:
---

Actually, I'm closing this as a Won't Fix since SPARK-12299 supersedes this.

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12062) Master rebuilding historical SparkUI should be asynchronous

2015-12-11 Thread Andrew Or (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Or resolved SPARK-12062.
---
Resolution: Won't Fix

> Master rebuilding historical SparkUI should be asynchronous
> ---
>
> Key: SPARK-12062
> URL: https://issues.apache.org/jira/browse/SPARK-12062
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.0.0
>Reporter: Andrew Or
>Assignee: Bryan Cutler
>
> When a long-running application finishes, it takes a while (sometimes 
> minutes) to rebuild the SparkUI. However, in Master.scala this is currently 
> done within the RPC event loop, which runs only in 1 thread. Thus, in the 
> mean time no other applications can register with this master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053922#comment-15053922
 ] 

holdenk commented on SPARK-2870:


Note: I discovered a related issue (namely schema inferance on a local 
collection of dictionaries only looks at the head).

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053921#comment-15053921
 ] 

Shivaram Venkataraman commented on SPARK-6270:
--

Yeah I think thats a fair solution to this problem. Could we fork the history 
server as a separate process in `start-master.sh` if event logging is enabled ? 
Then we might even be able to preserve some backward compatibility without 
users having to start an additional service. 

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053917#comment-15053917
 ] 

holdenk edited comment on SPARK-2870 at 12/12/15 1:27 AM:
--

So this seems to be resolved in Spark 1.6 with {code}createDataFrame{code}

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
{code}StructType(List(StructField(a,LongType,true),StructField(b,StringType,true))){code}

Do you think its OK to close this issue?


was (Author: holdenk):
So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053917#comment-15053917
 ] 

holdenk commented on SPARK-2870:


So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code:python}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread holdenk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053917#comment-15053917
 ] 

holdenk edited comment on SPARK-2870 at 12/12/15 1:26 AM:
--

So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?


was (Author: holdenk):
So this seems to be resolved in Spark 1.6 with `createDataFrame`

e.g.:

{code:python}
input = [{"a": 1}, {"b": "coffee"}]
rdd = sc.parallelize(input)
df = sqlContext.createDataFrame(rdd, samplingRatio=1.0)
print df.schema
{code}
Results in 
`StructType(List(StructField(a,LongType,true),StructField(b,StringType,true)))`

Do you think its OK to close this issue?

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053915#comment-15053915
 ] 

Andrew Or commented on SPARK-6270:
--

+1 to removing history serving functionality from standalone Master

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12276:


Assignee: (was: Apache Spark)

> Prevent RejectedExecutionException by checking if ThreadPoolExecutor is 
> shutdown and its capacity
> -
>
> Key: SPARK-12276
> URL: https://issues.apache.org/jira/browse/SPARK-12276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Priority: Minor
>
> We noticed that it is possible to throw RejectedExecutionException when 
> submitting thread in AppClient. The error is like following. We should add 
> some checks to prevent it.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@2077082c rejected from 
> java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12276) Prevent RejectedExecutionException by checking if ThreadPoolExecutor is shutdown and its capacity

2015-12-11 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-12276:


Assignee: Apache Spark

> Prevent RejectedExecutionException by checking if ThreadPoolExecutor is 
> shutdown and its capacity
> -
>
> Key: SPARK-12276
> URL: https://issues.apache.org/jira/browse/SPARK-12276
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Reporter: Liang-Chi Hsieh
>Assignee: Apache Spark
>Priority: Minor
>
> We noticed that it is possible to throw RejectedExecutionException when 
> submitting thread in AppClient. The error is like following. We should add 
> some checks to prevent it.
> java.util.concurrent.RejectedExecutionException: Task 
> java.util.concurrent.FutureTask@2077082c rejected from 
> java.util.concurrent.ThreadPoolExecutor@66b9915a[Running, pool size = 1, 
> active threads = 0, queued tasks = 0, completed tasks = 1]
> at 
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
> at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
> at 
> java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
> at 
> java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:110)
> at 
> org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$tryRegisterAllMasters$1.apply(AppClient.scala:96)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12298) Infinite loop in DataFrame.sortWithinPartitions(String, String*)

2015-12-11 Thread Ankur Dave (JIRA)
Ankur Dave created SPARK-12298:
--

 Summary: Infinite loop in DataFrame.sortWithinPartitions(String, 
String*)
 Key: SPARK-12298
 URL: https://issues.apache.org/jira/browse/SPARK-12298
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 1.6.0
Reporter: Ankur Dave
Assignee: Ankur Dave


The String overload of DataFrame.sortWithinPartitions calls itself when it 
should call the Column overload, causing an infinite loop:

{code}
Exception in thread "main" java.lang.StackOverflowError
at 
org.apache.spark.sql.DataFrame.sortWithinPartitions(DataFrame.scala:612)
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12297) Add work-around for Parquet/Hive int96 timestamp bug.

2015-12-11 Thread Ryan Blue (JIRA)
Ryan Blue created SPARK-12297:
-

 Summary: Add work-around for Parquet/Hive int96 timestamp bug.
 Key: SPARK-12297
 URL: https://issues.apache.org/jira/browse/SPARK-12297
 Project: Spark
  Issue Type: Task
  Components: Spark Core
Reporter: Ryan Blue


Hive has a bug where timestamps in Parquet data are incorrectly adjusted as 
though they were in the SQL session time zone to UTC. This is incorrect 
behavior because timestamp values are SQL timestamp without time zone and 
should not be internally changed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6523) Error when get attribute of StandardScalerModel, When use python api

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053830#comment-15053830
 ] 

Joseph K. Bradley commented on SPARK-6523:
--

You're right; sorry I did not see that PR as it was put into Spark.  I just 
made one specific to your need: [SPARK-12296]

> Error when get attribute of StandardScalerModel, When use python api
> 
>
> Key: SPARK-6523
> URL: https://issues.apache.org/jira/browse/SPARK-6523
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, MLlib, PySpark
>Affects Versions: 1.3.0
>Reporter: lee.xiaobo.2006
>
> test code
> ===
> from pyspark.mllib.util import MLUtils
> from pyspark.mllib.linalg import Vectors
> from pyspark.mllib.feature import StandardScaler
> conf = SparkConf().setAppName('Test')
> sc = SparkContext(conf=conf)
> data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
> label = data.map(lambda x: x.label)
> features = data.map(lambda x: x.features)
> scaler1 = StandardScaler().fit(features)
> print scaler1.std   # error
> sc.stop()
> ---
> error:
> Traceback (most recent call last):
>   File "/data1/s/apps/spark-app/app/test_ssm.py", line 22, in 
> print scaler1.std
> AttributeError: 'StandardScalerModel' object has no attribute 'std'
> 15/03/25 12:17:28 INFO Utils: path = 
> /data1/s/apps/spark-1.4.0-SNAPSHOT/data/spark-eb1ed7c0-a5ce-4748-a817-3cb0687ee282/blockmgr-5398b477-127d-4259-a71b-608a324e1cd3,
>  already present as root for deletion.
> =
> Another question, how to serialize or save the scaler model ?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12296:
-

 Summary: Feature parity for pyspark.mllib StandardScalerModel
 Key: SPARK-12296
 URL: https://issues.apache.org/jira/browse/SPARK-12296
 Project: Spark
  Issue Type: New Feature
  Components: MLlib, PySpark
Reporter: Joseph K. Bradley
Priority: Minor


Some methods are missing, such as ways to access the std, mean, etc.  This JIRA 
is for feature parity for pyspark.mllib.feature.StandardScaler & 
StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12296) Feature parity for pyspark.mllib StandardScalerModel

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12296?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12296:
--
Issue Type: Sub-task  (was: New Feature)
Parent: SPARK-11937

> Feature parity for pyspark.mllib StandardScalerModel
> 
>
> Key: SPARK-12296
> URL: https://issues.apache.org/jira/browse/SPARK-12296
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib, PySpark
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some methods are missing, such as ways to access the std, mean, etc.  This 
> JIRA is for feature parity for pyspark.mllib.feature.StandardScaler & 
> StandardScalerModel



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053827#comment-15053827
 ] 

Wenmin Wu commented on SPARK-12272:
---

I didn't run a test, but the click-through data of my company. There are 28905 
features and 18644639 records in this data.

I trained a GBDT model with 200 trees(equal to iterations) and maxDepth = 7. 
From the 'training-log1', you can see the first splitting takes 9.7 min. 
However, in xgboost single node implementation it takes less than 10 secs.

At first, I thought this is due to the statistics communication, but I look 
into the the details log in single executor just as the 'training-log2' shows. 
You can see in the single executor these steps take 8 - 9 min. 

I persist all the data in memory, as shown in 'training-log3'.

I also look into the source of GBDT implementation in spark and found that the 
time complexity of finding the first split is O(K * N) which is the same as the 
implementation in xgboost. So I ask you how I can accelerate the training of 
GBDT with spark.

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png, 
> training-log3.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log3.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png, 
> training-log3.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12183) Remove spark.mllib tree, forest implementations and use spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053823#comment-15053823
 ] 

Joseph K. Bradley commented on SPARK-12183:
---

Lower priority than both, really.  This is more of a clean-up task.  We could 
still improve the spark.ml code without doing this task, and GBT can be handled 
as a separate JIRA.  I'd say moving GBT code to spark.ml is higher priority 
than this since that is blocking adding more output columns to GBTs 
(rawPrediction, probability).

> Remove spark.mllib tree, forest implementations and use spark.ml
> 
>
> Key: SPARK-12183
> URL: https://issues.apache.org/jira/browse/SPARK-12183
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> This JIRA is for replacing the spark.mllib decision tree and random forest 
> implementations with the one from spark.ml.  The spark.ml one should be used 
> as a wrapper.  This should involve moving the implementation, but should 
> probably not require changing the tests (much).
> This blocks on 1 improvement to spark.mllib which needs to be ported to 
> spark.ml: [SPARK-10064]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11937) Python API coverage check found issues for ML during 1.6 QA

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11937:
--
Comment: was deleted

(was: User 'holdenk' has created a pull request for this issue:
https://github.com/apache/spark/pull/10085)

> Python API coverage check found issues for ML during 1.6 QA
> ---
>
> Key: SPARK-11937
> URL: https://issues.apache.org/jira/browse/SPARK-11937
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Yanbo Liang
>
> Here is the todo list of SPARK-11604 found issues:
> Note: I did not list the SparkR related features (such as 
> ml.feature.Interaction). We have supported RFormula as a wrapper at Python 
> side, I think we should discuss the necessary to support other R related 
> features at Python side.
> * Missing classes
> ** ml.attribute SPARK-8516
> ** ml.feature 
> *** QuantileDiscretizer SPARK-11922
> *** ChiSqSelector SPARK-11923
> ** ml.classification
> *** OneVsRest SPARK-7861
> ** ml.clustering 
> *** LDA SPARK-11940
> ** mllib.clustering
> *** BisectingKMeans SPARK-11944
> * Missing methods/parameters SPARK-11938
> ** ml.classification SPARK-11815 SPARK-11820 
> ** ml.feature SPARK-11925
> ** ml.clustering SPARK-11945
> ** mllib.linalg SPARK-12040 SPARK-12041
> ** mllib.stat.test.StreamingTest SPARK-12042
> * Docs:
> ** ml.classification SPARK-11875



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-6725) Model export/import for Pipeline API

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6725:
-
Comment: was deleted

(was: User 'anabranch' has created a pull request for this issue:
https://github.com/apache/spark/pull/10179)

> Model export/import for Pipeline API
> 
>
> Key: SPARK-6725
> URL: https://issues.apache.org/jira/browse/SPARK-6725
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Affects Versions: 1.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This is an umbrella JIRA for adding model export/import to the spark.ml API.  
> This JIRA is for adding the internal Saveable/Loadable API and Parquet-based 
> format, not for other formats like PMML.
> This will require the following steps:
> * Add export/import for all PipelineStages supported by spark.ml
> ** This will include some Transformers which are not Models.
> ** These can use almost the same format as the spark.mllib model save/load 
> functions, but the model metadata must store a different class name (marking 
> the class as a spark.ml class).
> * After all PipelineStages support save/load, add an interface which forces 
> future additions to support save/load.
> *UPDATE*: In spark.ml, we could save feature metadata using DataFrames.  
> Other libraries and formats can support this, and it would be great if we 
> could too.  We could do either of the following:
> * save() optionally takes a dataset (or schema), and load will return a 
> (model, schema) pair.
> * Models themselves save the input schema.
> Both options would mean inheriting from new Saveable, Loadable types.
> *UPDATE: DESIGN DOC*: Here's a design doc which I wrote.  If you have 
> comments about the planned implementation, please comment in this JIRA.  
> Thanks!  
> [https://docs.google.com/document/d/1RleM4QiKwdfZZHf0_G6FBNaF7_koc1Ui7qfMT1pf4IA/edit?usp=sharing]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log2.pnd.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png, training-log2.pnd.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: (was: screenshot-1.png)

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: screenshot-1.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: screenshot-1.png, training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053808#comment-15053808
 ] 

Evan Chen commented on SPARK-10931:
---

Hey Joseph,

Thanks for the suggestion. 
I was wondering what model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6270) Standalone Master hangs when streaming job completes and event logging is enabled

2015-12-11 Thread Josh Rosen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053810#comment-15053810
 ] 

Josh Rosen commented on SPARK-6270:
---

While I think that we should have this discussion about UI reconstruction of 
long-running applications, I think this is orthogonal to the right solution for 
this issue (SPARK-6270). The root problem here, related to the master / cluster 
manager dying, seems to be caused by a design flaw: why is the master 
responsible for serving historical UIs? The standalone history server process 
should have that responsibility, since UI serving might need a lot of memory.

I think the right fix here is to just remove the Master's embedded history 
server; I just don't think it makes sense to assign history server 
responsibilities to the master when it's designed to be a very 
low-resource-use, high-stability, high-resiliency service.

> Standalone Master hangs when streaming job completes and event logging is 
> enabled
> -
>
> Key: SPARK-6270
> URL: https://issues.apache.org/jira/browse/SPARK-6270
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy, Streaming
>Affects Versions: 1.2.0, 1.2.1, 1.3.0, 1.5.1
>Reporter: Tathagata Das
>Priority: Critical
>
> If the event logging is enabled, the Spark Standalone Master tries to 
> recreate the web UI of a completed Spark application from its event logs. 
> However if this event log is huge (e.g. for a Spark Streaming application), 
> then the master hangs in its attempt to read and recreate the web ui. This 
> hang causes the whole standalone cluster to be unusable. 
> Workaround is to disable the event logging.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10931) PySpark ML Models should contain Param values

2015-12-11 Thread Evan Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053808#comment-15053808
 ] 

Evan Chen edited comment on SPARK-10931 at 12/11/15 11:51 PM:
--

Hey Joseph,

Thanks for the suggestion. 
What model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.


was (Author: evanchen92):
Hey Joseph,

Thanks for the suggestion. 
I was wondering what model abstraction and getattr method are you referring to?
I modified every model on the Python side to reflect how it is being done on 
the Scala side. 
Let me know what you think.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12217:
--
Assignee: Benjamin Fradet  (was: Apache Spark)

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 1.6.1, 2.0.0
>
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-12217) Document invalid handling for StringIndexer

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-12217.
---
   Resolution: Fixed
Fix Version/s: 1.6.1
   2.0.0

Issue resolved by pull request 10257
[https://github.com/apache/spark/pull/10257]

> Document invalid handling for StringIndexer
> ---
>
> Key: SPARK-12217
> URL: https://issues.apache.org/jira/browse/SPARK-12217
> Project: Spark
>  Issue Type: Documentation
>  Components: ML
>Reporter: Benjamin Fradet
>Assignee: Benjamin Fradet
>Priority: Minor
> Fix For: 2.0.0, 1.6.1
>
>
> Documentation is needed regarding the handling of invalid labels in 
> StringIndexer



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11606.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

> ML 1.6 QA: Update user guide for new APIs
> -
>
> Key: SPARK-11606
> URL: https://issues.apache.org/jira/browse/SPARK-11606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
> Fix For: 1.6.0
>
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: Now that we have algorithms in spark.ml which are not in spark.mllib, 
> we should make subsections for the spark.ml API as needed. We can follow the 
> structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11606) ML 1.6 QA: Update user guide for new APIs

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053790#comment-15053790
 ] 

Joseph K. Bradley commented on SPARK-11606:
---

I'll close this now that [SPARK-12285] contains the remaining open tasks.

> ML 1.6 QA: Update user guide for new APIs
> -
>
> Key: SPARK-11606
> URL: https://issues.apache.org/jira/browse/SPARK-11606
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> Check the user guide vs. a list of new APIs (classes, methods, data members) 
> to see what items require updates to the user guide.
> For each feature missing user guide doc:
> * Create a JIRA for that feature, and assign it to the author of the feature
> * Link it to (a) the original JIRA which introduced that feature ("related 
> to") and (b) to this JIRA ("requires").
> Note: Now that we have algorithms in spark.ml which are not in spark.mllib, 
> we should make subsections for the spark.ml API as needed. We can follow the 
> structure of the spark.mllib user guide.
> * The spark.ml user guide can provide: (a) code examples and (b) info on 
> algorithms which do not exist in spark.mllib.
> * We should not duplicate info in the spark.ml guides. Since spark.mllib is 
> still the primary API, we should provide links to the corresponding 
> algorithms in the spark.mllib user guide for more info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12295) Manage the memory used by window function

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12295:
--

 Summary: Manage the memory used by window function
 Key: SPARK-12295
 URL: https://issues.apache.org/jira/browse/SPARK-12295
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


The buffered rows for a given frame should use UnsafeRow, and stored as pages.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12294) Support UnsafeRow in HiveTableScan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12294:
--

 Summary: Support UnsafeRow in HiveTableScan
 Key: SPARK-12294
 URL: https://issues.apache.org/jira/browse/SPARK-12294
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12293) Support UnsafeRow in LocalTableScan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12293:
--

 Summary: Support UnsafeRow in LocalTableScan
 Key: SPARK-12293
 URL: https://issues.apache.org/jira/browse/SPARK-12293
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12272) Gradient boosted trees: too slow at the first finding best siplts

2015-12-11 Thread Wenmin Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenmin Wu updated SPARK-12272:
--
Attachment: training-log1.png

> Gradient boosted trees: too slow at the first finding best siplts
> -
>
> Key: SPARK-12272
> URL: https://issues.apache.org/jira/browse/SPARK-12272
> Project: Spark
>  Issue Type: Request
>  Components: MLlib
>Affects Versions: 1.5.2
>Reporter: Wenmin Wu
> Attachments: training-log1.png
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12292) Support UnsafeRow in Generate

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12292:
--

 Summary: Support UnsafeRow in Generate
 Key: SPARK-12292
 URL: https://issues.apache.org/jira/browse/SPARK-12292
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12215) User guide section for KMeans in spark.ml

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12215:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> User guide section for KMeans in spark.ml
> -
>
> Key: SPARK-12215
> URL: https://issues.apache.org/jira/browse/SPARK-12215
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yu Ishikawa
>
> [~yuu.ishik...@gmail.com] Will you have time to add a user guide section for 
> this?  Thanks in advance!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12291) Support UnsafeRow in BroadcastLeftSemiJoinHash

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12291:
--

 Summary: Support UnsafeRow in BroadcastLeftSemiJoinHash
 Key: SPARK-12291
 URL: https://issues.apache.org/jira/browse/SPARK-12291
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12290) Change the default value in SparkPlan

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12290:
--

 Summary: Change the default value in SparkPlan
 Key: SPARK-12290
 URL: https://issues.apache.org/jira/browse/SPARK-12290
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu


supportUnsafeRows = true
supportSafeRows = false  //
outputUnsafeRows = true



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12289) Support UnsafeRow in TakeOrderedAndProject/Limit

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12289:
--

 Summary: Support UnsafeRow in TakeOrderedAndProject/Limit
 Key: SPARK-12289
 URL: https://issues.apache.org/jira/browse/SPARK-12289
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12288) Support UnsafeRow in Coalesce/Except/Intersect

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12288:
--

 Summary: Support UnsafeRow in Coalesce/Except/Intersect
 Key: SPARK-12288
 URL: https://issues.apache.org/jira/browse/SPARK-12288
 Project: Spark
  Issue Type: Improvement
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup

2015-12-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12287?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu updated SPARK-12287:
---
Issue Type: Improvement  (was: Epic)

> Support UnsafeRow in MapPartitions/MapGroups/CoGroup
> 
>
> Key: SPARK-12287
> URL: https://issues.apache.org/jira/browse/SPARK-12287
> Project: Spark
>  Issue Type: Improvement
>Reporter: Davies Liu
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Assignee: Yanbo Liang

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12287) Support UnsafeRow in MapPartitions/MapGroups/CoGroup

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12287:
--

 Summary: Support UnsafeRow in MapPartitions/MapGroups/CoGroup
 Key: SPARK-12287
 URL: https://issues.apache.org/jira/browse/SPARK-12287
 Project: Spark
  Issue Type: Epic
Reporter: Davies Liu






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2015-12-11 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-12286:
--

Assignee: Davies Liu

> Support UnsafeRow in all SparkPlan (if possible)
> 
>
> Key: SPARK-12286
> URL: https://issues.apache.org/jira/browse/SPARK-12286
> Project: Spark
>  Issue Type: Epic
>  Components: SQL
>Reporter: Davies Liu
>Assignee: Davies Liu
>
> There are still some SparkPlan does not support UnsafeRow (or does not 
> support well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12286) Support UnsafeRow in all SparkPlan (if possible)

2015-12-11 Thread Davies Liu (JIRA)
Davies Liu created SPARK-12286:
--

 Summary: Support UnsafeRow in all SparkPlan (if possible)
 Key: SPARK-12286
 URL: https://issues.apache.org/jira/browse/SPARK-12286
 Project: Spark
  Issue Type: Epic
  Components: SQL
Reporter: Davies Liu


There are still some SparkPlan does not support UnsafeRow (or does not support 
well).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11965) Update user guide for RFormula feature interactions

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11965:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Update user guide for RFormula feature interactions
> ---
>
> Key: SPARK-11965
> URL: https://issues.apache.org/jira/browse/SPARK-11965
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>
> Update the user guide for RFormula to cover feature interactions



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-6518) Add example code and user guide for bisecting k-means

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-6518:
-
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Add example code and user guide for bisecting k-means
> -
>
> Key: SPARK-6518
> URL: https://issues.apache.org/jira/browse/SPARK-6518
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Yu Ishikawa
>Assignee: Yu Ishikawa
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11959:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11959) Document normal equation solver for ordinary least squares in user guide

2015-12-11 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15053743#comment-15053743
 ] 

Joseph K. Bradley commented on SPARK-11959:
---

[~yanboliang] Will you have time to write this guide section?  If not, please 
let me know.

> Document normal equation solver for ordinary least squares in user guide
> 
>
> Key: SPARK-11959
> URL: https://issues.apache.org/jira/browse/SPARK-11959
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> Assigning since you wrote the feature, but please reassign as needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11529) Add section in user guide for StreamingLogisticRegressionWithSGD

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11529:
--
Issue Type: Sub-task  (was: Documentation)
Parent: SPARK-12285

> Add section in user guide for StreamingLogisticRegressionWithSGD
> 
>
> Key: SPARK-11529
> URL: https://issues.apache.org/jira/browse/SPARK-11529
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Reporter: Joseph K. Bradley
>
> [~freeman-lab] Would you be able to do this for 1.6?  Or if there are others 
> who can, could you please ping them?  Thanks!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12247:
--
Parent Issue: SPARK-12285  (was: SPARK-8517)

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12285) MLlib user guide: umbrella for missing sections

2015-12-11 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-12285:
-

 Summary: MLlib user guide: umbrella for missing sections
 Key: SPARK-12285
 URL: https://issues.apache.org/jira/browse/SPARK-12285
 Project: Spark
  Issue Type: Umbrella
  Components: Documentation, ML, MLlib
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


This is an umbrella for updating the MLlib user/programming guide for new APIs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >