[jira] [Resolved] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-22 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16107.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13820
[https://github.com/apache/spark/pull/13820]

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>  Labels: starter
> Fix For: 2.0.0
>
>
> Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
> read/write.ml(GLM) under Rd spark.glm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16118.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13821
[https://github.com/apache/spark/pull/13821]

> getDropLast is missing in OneHotEncoder
> ---
>
> Key: SPARK-16118
> URL: https://issues.apache.org/jira/browse/SPARK-16118
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.5.2, 1.6.1, 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
> Fix For: 2.0.0
>
>
> We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16118) getDropLast is missing in OneHotEncoder

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16118:
-

 Summary: getDropLast is missing in OneHotEncoder
 Key: SPARK-16118
 URL: https://issues.apache.org/jira/browse/SPARK-16118
 Project: Spark
  Issue Type: New Feature
  Components: MLlib
Affects Versions: 1.6.1, 1.5.2, 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


We forgot the getter of dropLast in OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16117) Hide LibSVMFileFormat in public API docs

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16117:
-

 Summary: Hide LibSVMFileFormat in public API docs
 Key: SPARK-16117
 URL: https://issues.apache.org/jira/browse/SPARK-16117
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


LibSVMFileFormat implements data source for LIBSVM format. However, users do 
not need to call its APIs to use it. So we should hide it in the public API 
docs. The main issue is that we still need to put the documentation and example 
code somewhere. The proposal it to have a dummy object to hold the 
documentation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-16113.
-
Resolution: Not A Problem

> Deprecate (or remove) multiclass APIs in ml.LogisticRegression
> --
>
> Key: SPARK-16113
> URL: https://issues.apache.org/jira/browse/SPARK-16113
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Based on the discussion in SPARK-7159, we are going to create a separate 
> class for multinomial logistic regression. So we should deprecate the methods 
> in ml.LogisticRegression that was made for multiclass support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342642#comment-15342642
 ] 

Xiangrui Meng commented on SPARK-16113:
---

Just realized that `thresholds` was inherited from 
`ProbabilisticClassifierParams`. So let's keep it to be consistent with other 
classifiers.

> Deprecate (or remove) multiclass APIs in ml.LogisticRegression
> --
>
> Key: SPARK-16113
> URL: https://issues.apache.org/jira/browse/SPARK-16113
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Based on the discussion in SPARK-7159, we are going to create a separate 
> class for multinomial logistic regression. So we should deprecate the methods 
> in ml.LogisticRegression that was made for multiclass support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16113:
-

 Summary: Deprecate (or remove) multiclass APIs in 
ml.LogisticRegression
 Key: SPARK-16113
 URL: https://issues.apache.org/jira/browse/SPARK-16113
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Based on the discussion in SPARK-7159, we are going to create a separate class 
for multinomial logistic regression. So we should deprecate the methods in 
ml.LogisticRegression that was made for multiclass support.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498
 ] 

Xiangrui Meng edited comment on SPARK-16111 at 6/21/16 7:26 PM:


Ping [~rajesh.balamohan]


was (Author: mengxr):
Ping [~rbalamohan]

> Hide SparkOrcNewRecordReader in API docs
> 
>
> Key: SPARK-16111
> URL: https://issues.apache.org/jira/browse/SPARK-16111
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it 
> appears on the top of the list in the Scala API doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498
 ] 

Xiangrui Meng edited comment on SPARK-16111 at 6/21/16 7:26 PM:


Ping [~rbalamohan]


was (Author: mengxr):
Ping [~ rbalamohan]

> Hide SparkOrcNewRecordReader in API docs
> 
>
> Key: SPARK-16111
> URL: https://issues.apache.org/jira/browse/SPARK-16111
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it 
> appears on the top of the list in the Scala API doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498
 ] 

Xiangrui Meng commented on SPARK-16111:
---

Ping [~ rbalamohan]

> Hide SparkOrcNewRecordReader in API docs
> 
>
> Key: SPARK-16111
> URL: https://issues.apache.org/jira/browse/SPARK-16111
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, SQL
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it 
> appears on the top of the list in the Scala API doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16111:
-

 Summary: Hide SparkOrcNewRecordReader in API docs
 Key: SPARK-16111
 URL: https://issues.apache.org/jira/browse/SPARK-16111
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, SQL
Reporter: Xiangrui Meng
Priority: Minor


We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it appears 
on the top of the list in the Scala API doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16086:
--
Fix Version/s: 1.6.2
   1.5.3

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.2, 2.0.0
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15741:
--
Assignee: Bryan Cutler

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15741:
--
Target Version/s: 2.0.0

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15741.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13672
[https://github.com/apache/spark/pull/13672]

> PySpark Cleanup of _setDefault with seed=None
> -
>
> Key: SPARK-15741
> URL: https://issues.apache.org/jira/browse/SPARK-15741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Bryan Cutler
>Priority: Minor
> Fix For: 2.0.0
>
>
> Several places in PySpark ML have Params._setDefault with a seed param equal 
> to {{None}}.  This is unnecessary as it will translate to a {{0}} even though 
> the param has a fixed value based by on the hashed classname by default.  
> Currently, the ALS doc test output depends on this happening and would be 
> more clear and stable if it was explicitly set to {{0}}.  These should be 
> cleaned up for stability and consistency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16107:
--
Assignee: Junyang Qian

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>
> spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16107:
--
Labels: starter  (was: )

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>  Labels: starter
>
> spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16107:
--
Description: Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
read/write.ml(GLM) under Rd spark.glm.  (was: spark.glm: spark.glm, glm, 
predict(GLM), summary(GLM), read/write.ml(GLM))

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Junyang Qian
>  Labels: starter
>
> Group API docs of spark.glm, glm, predict(GLM), summary(GLM), 
> read/write.ml(GLM) under Rd spark.glm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16107:
--
Description: spark.glm: spark.glm, glm, predict(GLM), summary(GLM), 
read/write.ml(GLM)

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342178#comment-15342178
 ] 

Xiangrui Meng commented on SPARK-16107:
---

ping [~junyangq]

> Group GLM-related methods in generated doc
> --
>
> Key: SPARK-16107
> URL: https://issues.apache.org/jira/browse/SPARK-16107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16107) Group GLM-related methods in generated doc

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16107:
-

 Summary: Group GLM-related methods in generated doc
 Key: SPARK-16107
 URL: https://issues.apache.org/jira/browse/SPARK-16107
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, SparkR
Affects Versions: 2.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342173#comment-15342173
 ] 

Xiangrui Meng commented on SPARK-16090:
---

I changed the issue type to umbrella since there could be well-separated 
sub-tasks.

> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16090:
--
Issue Type: Umbrella  (was: Improvement)

> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: make SparkR model params and default values consistent with MLlib

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15177:
--
Summary: SparkR 2.0 QA: make SparkR model params and default values 
consistent with MLlib  (was: SparkR 2.0 QA: New R APIs and API docs for mllib.R)

> SparkR 2.0 QA: make SparkR model params and default values consistent with 
> MLlib
> 
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Audit new public R APIs in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15177:
--
Description: Audit new public R APIs in mllib.R  (was: Audit new public R 
APIs in mllib.R.)

> SparkR 2.0 QA: New R APIs and API docs for mllib.R
> --
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Audit new public R APIs in mllib.R



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: make SparkR model params and default values consistent with MLlib

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15177:
--
Description: Make SparkR model params and default values consistent with 
MLlib  (was: Audit new public R APIs in mllib.R)

> SparkR 2.0 QA: make SparkR model params and default values consistent with 
> MLlib
> 
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Make SparkR model params and default values consistent with MLlib



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15177:
--
Shepherd: Xiangrui Meng

> SparkR 2.0 QA: New R APIs and API docs for mllib.R
> --
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Audit new public R APIs in mllib.R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15177.
---
  Resolution: Fixed
   Fix Version/s: 2.0.0
Target Version/s: 2.0.0

I marked this JIRA as resolved. The API doc changes would be merged into 
SPARK-16090, which affects how we write API docs.

> SparkR 2.0 QA: New R APIs and API docs for mllib.R
> --
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Audit new public R APIs in mllib.R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15177:
--
Assignee: Yanbo Liang

> SparkR 2.0 QA: New R APIs and API docs for mllib.R
> --
>
> Key: SPARK-15177
> URL: https://issues.apache.org/jira/browse/SPARK-15177
> Project: Spark
>  Issue Type: Documentation
>  Components: ML, SparkR
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public R APIs in mllib.R.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341330#comment-15341330
 ] 

Xiangrui Meng commented on SPARK-16071:
---

[~ding] This JIRA is not to solve this particular issue only. It would be nice 
to make a pass over the implementation of UnsafeArrayData and relevant classes 
to find issues like this and put the size check early. It would be nice to 
mention the size limit in user guide and error messages too.

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of later throwing NegativeArraySize (which is 
> slow and might cause silent errors)
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but slow (see SPARK-16043)
> * n=2e8: NegativeArraySize exception
> {code:none}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:85)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> * n=3e8: NegativeArraySize exception but raised at a different location
> {code:none}
> java.lang.RuntimeException: Error while encoding: 
> java.lang.NegativeArraySizeException
> newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
> value#108
> +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
>+- input[0, [I, true]
>   at 
> org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at 
> org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at 
> 

[jira] [Resolved] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16045.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13375
[https://github.com/apache/spark/pull/13375]

> Spark 2.0 ML.feature: doc update for stopwords and binarizer
> 
>
> Key: SPARK-16045
> URL: https://issues.apache.org/jira/browse/SPARK-16045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> 2.0 Audit: Update document for StopWordsRemover (load stop words) and 
> Binarizer (support of Vector)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16045:
--
Assignee: yuhao yang

> Spark 2.0 ML.feature: doc update for stopwords and binarizer
> 
>
> Key: SPARK-16045
> URL: https://issues.apache.org/jira/browse/SPARK-16045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> 2.0 Audit: Update document for StopWordsRemover (load stop words) and 
> Binarizer (support of Vector)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16045:
--
Affects Version/s: 2.0.0
 Target Version/s: 2.0.0

> Spark 2.0 ML.feature: doc update for stopwords and binarizer
> 
>
> Key: SPARK-16045
> URL: https://issues.apache.org/jira/browse/SPARK-16045
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
> Fix For: 2.0.0
>
>
> 2.0 Audit: Update document for StopWordsRemover (load stop words) and 
> Binarizer (support of Vector)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7751) Add @Since annotation to stable and experimental methods in MLlib

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-7751.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

Mark this umbrella as resolved since all sub-tasks are done. Thanks everyone 
for contributing!!

> Add @Since annotation to stable and experimental methods in MLlib
> -
>
> Key: SPARK-7751
> URL: https://issues.apache.org/jira/browse/SPARK-7751
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>
> This is useful to check whether a feature exists in some version of Spark. 
> This is an umbrella JIRA to track the progress. We want to have -@since tag- 
> @Since annotation for both stable (those without any 
> Experimental/DeveloperApi/AlphaComponent annotations) and experimental 
> methods in MLlib:
> (Do NOT tag private or package private classes or methods, nor local 
> variables and methods.)
> * an example PR for Scala: https://github.com/apache/spark/pull/8309
> We need to dig the history of git commit to figure out what was the Spark 
> version when a method was first introduced. Take `NaiveBayes.setModelType` as 
> an example. We can grep `def setModelType` at different version git tags.
> {code}
> meng@xm:~/src/spark
> $ git show 
> v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
> meng@xm:~/src/spark
> $ git show 
> v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala
>  | grep "def setModelType"
>   def setModelType(modelType: String): NaiveBayes = {
> {code}
> If there are better ways, please let us know.
> We cannot add all -@since tags- @Since annotation in a single PR, which is 
> hard to review. So we made some subtasks for each package, for example 
> `org.apache.spark.classification`. Feel free to add more sub-tasks for Python 
> and the `spark.ml` package.
> Plan:
> 1. In 1.5, we try to add @Since annotation to all stable/experimental methods 
> under `spark.mllib`.
> 2. Starting from 1.6, we require @Since annotation in all new PRs.
> 3. In 1.6, we try to add @SInce annotation to all stable/experimental methods 
> under `spark.ml`, `pyspark.mllib`, and `pyspark.ml`.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-10258) Add @Since annotation to ml.feature

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-10258:
--
Shepherd: Nick Pentreath

> Add @Since annotation to ml.feature
> ---
>
> Key: SPARK-10258
> URL: https://issues.apache.org/jira/browse/SPARK-10258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Martin Brown
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-10258) Add @Since annotation to ml.feature

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-10258.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13641
[https://github.com/apache/spark/pull/13641]

> Add @Since annotation to ml.feature
> ---
>
> Key: SPARK-10258
> URL: https://issues.apache.org/jira/browse/SPARK-10258
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML
>Reporter: Xiangrui Meng
>Assignee: Martin Brown
>Priority: Minor
>  Labels: starter
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341297#comment-15341297
 ] 

Xiangrui Meng commented on SPARK-16086:
---

Reverted the changes in master and branch-2.0 since they broke the builds:

* 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-sbt-hadoop-2.2/240/console
* 
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.2/1212/consoleFull

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.2
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Reopened] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reopened SPARK-16086:
---

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.2
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16086:
--
Fix Version/s: (was: 2.0.0)

> Python UDF failed when there is no arguments
> 
>
> Key: SPARK-16086
> URL: https://issues.apache.org/jira/browse/SPARK-16086
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.5.2, 1.6.1
>Reporter: Davies Liu
>Assignee: Davies Liu
> Fix For: 1.5.3, 1.6.2
>
>
> {code}
> >>> sqlContext.registerFunction("f", lambda : "a")
> >>> sqlContext.sql("select f()").show()
> {code}
> {code}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 
> (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): 
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 111, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 106, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/serializers.py", line 263, in 
> dump_stream
> vs = list(itertools.islice(iterator, batch))
>   File "/databricks/spark/python/pyspark/serializers.py", line 139, in 
> load_stream
> yield self._read_with_length(stream)
>   File "/databricks/spark/python/pyspark/serializers.py", line 164, in 
> _read_with_length
> return self.loads(obj)
>   File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads
> return pickle.loads(obj)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in 
> return lambda *a: dataType.fromInternal(a)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 568, in 
> fromInternal
> return _create_row(self.names, values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in 
> _create_row
> row = Row(*values)
>   File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__
> raise ValueError("No args or kwargs")
> ValueError: (ValueError('No args or kwargs',),  at 
> 0x7f3bbc463320>, ())
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>   at 
> org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207)
>   at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405)
>   at 
> org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
>   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
>   at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72)
>   at org.apache.spark.scheduler.Task.run(Task.scala:96)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341258#comment-15341258
 ] 

Xiangrui Meng commented on SPARK-16090:
---

For ML methods, I'd like to propose the following grouping:

* spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)
* spark.naiveBayes: spark.naiveBayes, predict(NB), summary(NB), 
read/write.ml(NB)
* spark.kmeans: spark.kmeans, predict(KM), summary(KM), read/write.ml(KM)
* spark.survreg: .spark.survreg, predict(SR), summary(SR), read/write.ml(SR)

Then put a separate doc page for each generic method, including predict, 
summary, read/write.ml, and link to the doc pages above using see also.

> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16090:
--
Description: This JIRA follows the discussion on 
https://github.com/apache/spark/pull/13109 to improve method grouping in SparkR 
generated docs. Having one method per doc page is not an R convention. However, 
having many methods per doc page would hurt the readability. So a proper 
grouping would help. Since we use roxygen2 instead of writing Rd files 
directly, we should consider smaller groups to avoid confusion. 

> Improve method grouping in SparkR generated docs
> 
>
> Key: SPARK-16090
> URL: https://issues.apache.org/jira/browse/SPARK-16090
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, SparkR
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This JIRA follows the discussion on 
> https://github.com/apache/spark/pull/13109 to improve method grouping in 
> SparkR generated docs. Having one method per doc page is not an R convention. 
> However, having many methods per doc page would hurt the readability. So a 
> proper grouping would help. Since we use roxygen2 instead of writing Rd files 
> directly, we should consider smaller groups to avoid confusion. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16090) Improve method grouping in SparkR generated docs

2016-06-21 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16090:
-

 Summary: Improve method grouping in SparkR generated docs
 Key: SPARK-16090
 URL: https://issues.apache.org/jira/browse/SPARK-16090
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, SparkR
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Priority: Critical






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16079.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13787
[https://github.com/apache/spark/pull/13787]

> PySpark ML classification missing import of DecisionTreeRegressionModel for 
> GBTClassificationModel
> --
>
> Key: SPARK-16079
> URL: https://issues.apache.org/jira/browse/SPARK-16079
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
> Fix For: 2.0.0
>
>
> In GBTClassificationModel, the overloaded method {{trees}} casts the 
> DecisionTree to a DecisionTreeRegressionModel, however, the import for this 
> class is missing and leads to a {{NameError}}
> {noformat}
> spark/python/pyspark/ml/classification.pyc in trees(self)
> 888 def trees(self):
> 889 """Trees in this ensemble. Warning: These have null parent 
> Estimators."""
> --> 890 return [DecisionTreeRegressionModel(m) for m in 
> list(self._call_java("trees"))]
> 891 
> 892 
> NameError: global name 'DecisionTreeRegressionModel' is not defined
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16079:
--
Assignee: Bryan Cutler

> PySpark ML classification missing import of DecisionTreeRegressionModel for 
> GBTClassificationModel
> --
>
> Key: SPARK-16079
> URL: https://issues.apache.org/jira/browse/SPARK-16079
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>Assignee: Bryan Cutler
> Fix For: 2.0.0
>
>
> In GBTClassificationModel, the overloaded method {{trees}} casts the 
> DecisionTree to a DecisionTreeRegressionModel, however, the import for this 
> class is missing and leads to a {{NameError}}
> {noformat}
> spark/python/pyspark/ml/classification.pyc in trees(self)
> 888 def trees(self):
> 889 """Trees in this ensemble. Warning: These have null parent 
> Estimators."""
> --> 890 return [DecisionTreeRegressionModel(m) for m in 
> list(self._call_java("trees"))]
> 891 
> 892 
> NameError: global name 'DecisionTreeRegressionModel' is not defined
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16079:
--
Affects Version/s: 2.0.0

> PySpark ML classification missing import of DecisionTreeRegressionModel for 
> GBTClassificationModel
> --
>
> Key: SPARK-16079
> URL: https://issues.apache.org/jira/browse/SPARK-16079
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Bryan Cutler
>
> In GBTClassificationModel, the overloaded method {{trees}} casts the 
> DecisionTree to a DecisionTreeRegressionModel, however, the import for this 
> class is missing and leads to a {{NameError}}
> {noformat}
> spark/python/pyspark/ml/classification.pyc in trees(self)
> 888 def trees(self):
> 889 """Trees in this ensemble. Warning: These have null parent 
> Estimators."""
> --> 890 return [DecisionTreeRegressionModel(m) for m in 
> list(self._call_java("trees"))]
> 891 
> 892 
> NameError: global name 'DecisionTreeRegressionModel' is not defined
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340642#comment-15340642
 ] 

Xiangrui Meng commented on SPARK-16074:
---

Picked option 2) because we don't have any Java source code in MLlib. The 
overhead for Java users is the extra `()`.

> Expose VectorUDT/MatrixUDT in a public API
> --
>
> Key: SPARK-16074
> URL: https://issues.apache.org/jira/browse/SPARK-16074
> Project: Spark
>  Issue Type: New Feature
>  Components: MLilb
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
> is private in Spark. However, in order to let developers implement their own 
> transformers and estimators, we should expose both types in a public API to 
> simply the implementation of transformSchema, transform, etc. Otherwise, they 
> need to get the data types using reflection.
> Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can 
> just have a method or a static value that returns VectorUDT/MatrixUDT 
> instance with DataType as the return type. There are two ways to implement 
> this:
> 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()".
> 2. Define DataTypes in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-16074:
-

Assignee: Xiangrui Meng

> Expose VectorUDT/MatrixUDT in a public API
> --
>
> Key: SPARK-16074
> URL: https://issues.apache.org/jira/browse/SPARK-16074
> Project: Spark
>  Issue Type: New Feature
>  Components: MLilb
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>Priority: Critical
>
> Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
> is private in Spark. However, in order to let developers implement their own 
> transformers and estimators, we should expose both types in a public API to 
> simply the implementation of transformSchema, transform, etc. Otherwise, they 
> need to get the data types using reflection.
> Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can 
> just have a method or a static value that returns VectorUDT/MatrixUDT 
> instance with DataType as the return type. There are two ways to implement 
> this:
> 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()".
> 2. Define DataTypes in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package

2016-06-20 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340419#comment-15340419
 ] 

Xiangrui Meng commented on SPARK-16075:
---

I'm not sure whether we should make this change in 2.0. It is not a trivial 
change, though it brings some benefits.

> Make VectorUDT/MatrixUDT singleton under spark.ml package
> -
>
> Key: SPARK-16075
> URL: https://issues.apache.org/jira/browse/SPARK-16075
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Both VectorUDT and MatrixUDT are implemented as normal classes and their 
> could be multiple instances of it, which makes the equality checking and 
> pattern matching harder to implement. Even the APIs are private, switching to 
> a singleton pattern could simplify the development.
> Required changes:
> * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate)
> * update UDTRegistration
> * update code generation to support singleton UDTs
> * update existing code to use getOrCreate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16075:
--
Description: 
Both VectorUDT and MatrixUDT are implemented as normal classes and their could 
be multiple instances of it, which makes the equality checking and pattern 
matching harder to implement. Even the APIs are private, switching to a 
singleton pattern could simplify the development.

Required changes:
* singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate)
* update UDTRegistration
* update code generation to support singleton UDTs
* update existing code to use getOrCreate

  was:
Both VectorUDT and MatrixUDT are implemented as normal classes and their could 
be multiple instances of it, which makes the equality checking and pattern 
matching harder to implement. Even the APIs are private, switching to a 
singleton pattern could simplify the development.

Required changes:
* singleton VectorUDT/MatrixUDT
* add UDTFactory trait with getOrCreate to return the singleton instance
* update UDTRegistration
* update code generation to support UDTFactory


> Make VectorUDT/MatrixUDT singleton under spark.ml package
> -
>
> Key: SPARK-16075
> URL: https://issues.apache.org/jira/browse/SPARK-16075
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Both VectorUDT and MatrixUDT are implemented as normal classes and their 
> could be multiple instances of it, which makes the equality checking and 
> pattern matching harder to implement. Even the APIs are private, switching to 
> a singleton pattern could simplify the development.
> Required changes:
> * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate)
> * update UDTRegistration
> * update code generation to support singleton UDTs
> * update existing code to use getOrCreate



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16074:
--
Description: 
Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
is private in Spark. However, in order to let developers implement their own 
transformers and estimators, we should expose both types in a public API to 
simply the implementation of transformSchema, transform, etc. Otherwise, they 
need to get the data types using reflection.

Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just 
have a method or a static value that returns VectorUDT/MatrixUDT instance with 
DataType as the return type. There are two ways to implement this:
1. following DataTypes.java in SQL, so Java users doesn't need the extra "()".
2. Define DataTypes in Scala.

  was:
Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
is private in Spark. However, in order to let developers implement their own 
transformers and estimators, we should expose both types in a public API to 
simply the implementation of transformSchema, transform, etc. Otherwise, they 
need to get the data types using reflection.

Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just 
have a method or a static value that returns VectorUDT/MatrixUDT instance with 
DataType as the return type.


> Expose VectorUDT/MatrixUDT in a public API
> --
>
> Key: SPARK-16074
> URL: https://issues.apache.org/jira/browse/SPARK-16074
> Project: Spark
>  Issue Type: New Feature
>  Components: MLilb
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
> is private in Spark. However, in order to let developers implement their own 
> transformers and estimators, we should expose both types in a public API to 
> simply the implementation of transformSchema, transform, etc. Otherwise, they 
> need to get the data types using reflection.
> Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can 
> just have a method or a static value that returns VectorUDT/MatrixUDT 
> instance with DataType as the return type. There are two ways to implement 
> this:
> 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()".
> 2. Define DataTypes in Scala.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package

2016-06-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16075:
-

 Summary: Make VectorUDT/MatrixUDT singleton under spark.ml package
 Key: SPARK-16075
 URL: https://issues.apache.org/jira/browse/SPARK-16075
 Project: Spark
  Issue Type: Improvement
  Components: MLlib
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Assignee: Xiangrui Meng


Both VectorUDT and MatrixUDT are implemented as normal classes and their could 
be multiple instances of it, which makes the equality checking and pattern 
matching harder to implement. Even the APIs are private, switching to a 
singleton pattern could simplify the development.

Required changes:
* singleton VectorUDT/MatrixUDT
* add UDTFactory trait with getOrCreate to return the singleton instance
* update UDTRegistration
* update code generation to support UDTFactory



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API

2016-06-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16074:
-

 Summary: Expose VectorUDT/MatrixUDT in a public API
 Key: SPARK-16074
 URL: https://issues.apache.org/jira/browse/SPARK-16074
 Project: Spark
  Issue Type: New Feature
  Components: MLilb
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Priority: Critical


Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself 
is private in Spark. However, in order to let developers implement their own 
transformers and estimators, we should expose both types in a public API to 
simply the implementation of transformSchema, transform, etc. Otherwise, they 
need to get the data types using reflection.

Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just 
have a method or a static value that returns VectorUDT/MatrixUDT instance with 
DataType as the return type.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16073:
--
Description: 
Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
data. However, Parquet also has its own encodings to compress columns/arrays, 
e.g., dictionary encoding: 
https://github.com/apache/parquet-format/blob/master/Encodings.md.

It might be worth checking the performance overhead of Parquet encodings on 
saving large primitive arrays, which is a machine learning use case. If the 
overhead is significant, we should expose a configuration in Spark to control 
the encoding levels.

Note that this shouldn't be tested under Spark until SPARK-16043 was fixed.

  was:
Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
data. However, Parquet also has its own encodings to compress columns/arrays, 
e.g., dictionary encoding: 
https://github.com/apache/parquet-format/blob/master/Encodings.md.

It might be worth checking the performance overhead of Parquet encodings for 
saving primitive arrays, which is a machine learning use case. Note that this 
shouldn't be tested under Spark until SPARK-16043 was fixed.


> Performance of Parquet encodings on saving primitive arrays
> ---
>
> Key: SPARK-16073
> URL: https://issues.apache.org/jira/browse/SPARK-16073
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
> data. However, Parquet also has its own encodings to compress columns/arrays, 
> e.g., dictionary encoding: 
> https://github.com/apache/parquet-format/blob/master/Encodings.md.
> It might be worth checking the performance overhead of Parquet encodings on 
> saving large primitive arrays, which is a machine learning use case. If the 
> overhead is significant, we should expose a configuration in Spark to control 
> the encoding levels.
> Note that this shouldn't be tested under Spark until SPARK-16043 was fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16073:
--
Description: 
Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
data. However, Parquet also has its own encodings to compress columns/arrays, 
e.g., dictionary encoding: 
https://github.com/apache/parquet-format/blob/master/Encodings.md.

It might be worth checking the performance overhead of Parquet encodings for 
saving primitive arrays, which is a machine learning use case. Note that this 
shouldn't be tested under Spark until SPARK-16043 was fixed.

> Performance of Parquet encodings on saving primitive arrays
> ---
>
> Key: SPARK-16073
> URL: https://issues.apache.org/jira/browse/SPARK-16073
> Project: Spark
>  Issue Type: Task
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet 
> data. However, Parquet also has its own encodings to compress columns/arrays, 
> e.g., dictionary encoding: 
> https://github.com/apache/parquet-format/blob/master/Encodings.md.
> It might be worth checking the performance overhead of Parquet encodings for 
> saving primitive arrays, which is a machine learning use case. Note that this 
> shouldn't be tested under Spark until SPARK-16043 was fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16073:
-

 Summary: Performance of Parquet encodings on saving primitive 
arrays
 Key: SPARK-16073
 URL: https://issues.apache.org/jira/browse/SPARK-16073
 Project: Spark
  Issue Type: Task
  Components: MLlib, SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* SPARK-16071: Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
* SPARK-16073: Performance of Parquet encodings on saving primitive arrays

  was:
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* SPARK-16071: Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
* Performance of Parquet encodings on saving primitive arrays


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * SPARK-16071: Not sufficient array size checks 
> ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
>  or silent errors)
> * SPARK-16073: Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of later throwing NegativeArraySize (which is 
slow and might cause silent errors)
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but slow (see SPARK-16043)
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but raised at a different location

{code:none}
java.lang.RuntimeException: Error while encoding: 
java.lang.NegativeArraySizeException
newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
value#108
+- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
   +- input[0, [I, true]

at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* SPARK-16071: Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
* Performance of Parquet encodings on saving primitive arrays

  was:
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
* Performance of Parquet encodings on saving primitive arrays


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * SPARK-16071: Not sufficient array size checks 
> ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
>  or silent errors)
> * Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of later throwing NegativeArraySize (which is 
slow and might cause silent errors)
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow (see SPARK-16043)
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but raised at a different location

{code:none}
java.lang.RuntimeException: Error while encoding: 
java.lang.NegativeArraySizeException
newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
value#108
+- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
   +- input[0, [I, true]

at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of later throwing NegativeArraySize (which is 
slow and might cause silent errors)
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow (see SPARK-16043)
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but at a different location

{code:none}
java.lang.RuntimeException: Error while encoding: 
java.lang.NegativeArraySizeException
newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
value#108
+- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
   +- input[0, [I, true]

at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 

[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but at a different location

{code:none}
java.lang.RuntimeException: Error while encoding: 
java.lang.NegativeArraySizeException
newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
value#108
+- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
   +- input[0, [I, true]

at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

  was:
Several 

[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow (see SPARK-16043)
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but at a different location

{code:none}
java.lang.RuntimeException: Error while encoding: 
java.lang.NegativeArraySizeException
newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS 
value#108
+- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData)
   +- input[0, [I, true]

at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at 
org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception

{code:none}
java.lang.NegativeArraySizeException
at 
org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
at 
org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
{code}

* n=3e8: NegativeArraySize exception but at a different location

  was:
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception
* n=3e8: NegativeArraySize exception but at a different location


> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yin Huai
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of NegativeArraySize
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but with slow
> * n=2e8: NegativeArraySize exception
> {code:none}
> java.lang.NegativeArraySizeException
>   at 
> org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123)
>   at 
> 

[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception
* n=3e8: NegativeArraySize exception but at a different location

  was:
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception


> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yin Huai
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of NegativeArraySize
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but with slow
> * n=2e8: NegativeArraySize exception
> * n=3e8: NegativeArraySize exception but at a different location



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Assignee: Yin Huai

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Yin Huai
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of NegativeArraySize
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but with slow
> * n=2e8: NegativeArraySize exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Description: 
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.

To reproduce one of the issues:

{code}
val n = 1e8.toInt // try 2e8, 3e8
sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show()
{code}

Result:
* n=1e8: correct but with slow
* n=2e8: NegativeArraySize exception

  was:
Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.


> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of NegativeArraySize
> * Document clearly the largest array size we support in DataFrames.
> To reproduce one of the issues:
> {code}
> val n = 1e8.toInt // try 2e8, 3e8
> sc.parallelize(0 until 1, 1).map(i => new 
> Array[Int](n)).toDS.map(_.size).show()
> {code}
> Result:
> * n=1e8: correct but with slow
> * n=2e8: NegativeArraySize exception



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16071:
--
Summary: Not sufficient array size checks to avoid integer overflows in 
Tungsten  (was: Not sufficient size checks to avoid integer overflows in 
Tungsten)

> Not sufficient array size checks to avoid integer overflows in Tungsten
> ---
>
> Key: SPARK-16071
> URL: https://issues.apache.org/jira/browse/SPARK-16071
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Priority: Critical
>
> Several bugs have been found caused by integer overflows in Tungsten. This 
> JIRA is for taking a final pass before 2.0 release to reduce potential bugs 
> and issues. We should do at least the following:
> * Raise exception early instead of NegativeArraySize
> * Document clearly the largest array size we support in DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16071) Not sufficient size checks to avoid integer overflows in Tungsten

2016-06-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16071:
-

 Summary: Not sufficient size checks to avoid integer overflows in 
Tungsten
 Key: SPARK-16071
 URL: https://issues.apache.org/jira/browse/SPARK-16071
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng
Priority: Critical


Several bugs have been found caused by integer overflows in Tungsten. This JIRA 
is for taking a final pass before 2.0 release to reduce potential bugs and 
issues. We should do at least the following:

* Raise exception early instead of NegativeArraySize
* Document clearly the largest array size we support in DataFrames.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays

  was:
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException
 NegativeArraySizeException]] or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * Not sufficient array size checks 
> ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
>  or silent errors)
> ** There 
> * Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
* Performance of Parquet encodings on saving primitive arrays

  was:
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
 or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * Not sufficient array size checks 
> ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException]
>  or silent errors)
> * Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException
 NegativeArraySizeException]] or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays

  was:
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException)
 or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * Not sufficient array size checks 
> ([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException
>  NegativeArraySizeException]] or silent errors)
> ** There 
> * Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16070:
--
Description: 
I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as (usually large) primitive arrays.

Issues:
* SPARK-16043: Tungsten array data is not specialized for primitive types
* Not sufficient array size checks 
([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException)
 or silent errors)
** There 
* Performance of Parquet encodings on saving primitive arrays

  was:I created this umbrella JIRA to track DataFrame/Parquet issues with 
primitive arrays. This is mostly related to machine learning use cases, where 
feature indices/values are stored as primitive arrays.


> DataFrame/Parquet issues with primitive arrays
> --
>
> Key: SPARK-16070
> URL: https://issues.apache.org/jira/browse/SPARK-16070
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib, SQL
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>
> I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
> arrays. This is mostly related to machine learning use cases, where feature 
> indices/values are stored as (usually large) primitive arrays.
> Issues:
> * SPARK-16043: Tungsten array data is not specialized for primitive types
> * Not sufficient array size checks 
> ([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException)
>  or silent errors)
> ** There 
> * Performance of Parquet encodings on saving primitive arrays



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16070) DataFrame/Parquet issues with primitive arrays

2016-06-20 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16070:
-

 Summary: DataFrame/Parquet issues with primitive arrays
 Key: SPARK-16070
 URL: https://issues.apache.org/jira/browse/SPARK-16070
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib, SQL
Affects Versions: 2.0.0
Reporter: Xiangrui Meng


I created this umbrella JIRA to track DataFrame/Parquet issues with primitive 
arrays. This is mostly related to machine learning use cases, where feature 
indices/values are stored as primitive arrays.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16035:
--
Assignee: Andrea Pasqua

> The SparseVector parser fails checking for valid end parenthesis
> 
>
> Key: SPARK-16035
> URL: https://issues.apache.org/jira/browse/SPARK-16035
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Andrea Pasqua
>Assignee: Andrea Pasqua
>Priority: Minor
> Fix For: 1.6.2, 2.0.0
>
>
> Running
>   SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ')
> will not raise an exception as expected, although it parses it as if there 
> was an end parenthesis.
> This can be fixed by replacing
>   if start == -1:
>raise ValueError("Tuple should end with ')'")
> with
>  if end == -1:
>raise ValueError("Tuple should end with ')'")
> Please see posted PR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16035.
---
   Resolution: Fixed
Fix Version/s: 1.6.2
   2.0.0

Issue resolved by pull request 13750
[https://github.com/apache/spark/pull/13750]

> The SparseVector parser fails checking for valid end parenthesis
> 
>
> Key: SPARK-16035
> URL: https://issues.apache.org/jira/browse/SPARK-16035
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Andrea Pasqua
>Priority: Minor
> Fix For: 2.0.0, 1.6.2
>
>
> Running
>   SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ')
> will not raise an exception as expected, although it parses it as if there 
> was an end parenthesis.
> This can be fixed by replacing
>   if start == -1:
>raise ValueError("Tuple should end with ')'")
> with
>  if end == -1:
>raise ValueError("Tuple should end with ')'")
> Please see posted PR



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15129.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13285
[https://github.com/apache/spark/pull/13285]

> Clarify conventions for calling Spark and MLlib from R
> --
>
> Key: SPARK-15129
> URL: https://issues.apache.org/jira/browse/SPARK-15129
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: Gayathri Murali
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Since some R API modifications happened in 2.0, we need to make the new 
> standards clear in the user guide.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15892:
--
Fix Version/s: 2.0.0

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
> Fix For: 1.6.2, 2.0.0
>
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15892.
---
   Resolution: Fixed
Fix Version/s: (was: 2.0.0)
   1.6.2

Issue resolved by pull request 13725
[https://github.com/apache/spark/pull/13725]

> Incorrectly merged AFTAggregator with zero total count
> --
>
> Key: SPARK-15892
> URL: https://issues.apache.org/jira/browse/SPARK-15892
> Project: Spark
>  Issue Type: Bug
>  Components: Examples, ML, PySpark
>Affects Versions: 1.6.1, 2.0.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
> Fix For: 1.6.2
>
>
> Running the example (after the fix in 
> [https://github.com/apache/spark/pull/13393]) causes this failure:
> {code}
> Traceback (most recent call last):
>   
>   File 
> "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py",
>  line 49, in 
> model = aft.fit(training)
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", 
> line 64, in fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 213, in _fit
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", 
> line 210, in _fit_java
>   File 
> "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", 
> line 933, in __call__
>   File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", 
> line 79, in deco
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number 
> of instances should be greater than 0.0, but got 0.'
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15603.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

> Replace SQLContext with SparkSession in ML/MLLib
> 
>
> Key: SPARK-15603
> URL: https://issues.apache.org/jira/browse/SPARK-15603
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Dongjoon Hyun
>Assignee: Dongjoon Hyun
> Fix For: 2.0.0
>
>
> This issue replaces all deprecated `SQLContext` occurrences with 
> `SparkSession` in `ML/MLLib` module except the following two classes. These 
> two classes use `SQLContext` as their function arguments.
> - ReadWrite.scala
> - TreeModels.scala



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-16008.
---
   Resolution: Fixed
Fix Version/s: 2.0.0

Issue resolved by pull request 13729
[https://github.com/apache/spark/pull/13729]

> ML Logistic Regression aggregator serializes unnecessary data
> -
>
> Key: SPARK-16008
> URL: https://issues.apache.org/jira/browse/SPARK-16008
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
> Fix For: 2.0.0
>
>
> LogisticRegressionAggregator class is used to collect gradient updates in ML 
> logistic regression algorithm. The class stores a reference to the 
> coefficients array of length equal to the number of features. It also stores 
> a reference to an array of standard deviations which is length numFeatures 
> also. When a task is completed it serializes the class which also serializes 
> a copy of the two arrays. These arrays don't need to be serialized (only the 
> gradient updates are being aggregated). This causes issues performance issues 
> when the number of features is large and can trigger excess garbage 
> collection when the executor doesn't have much excess memory. 
> This results in serializing 2*numFeatures excess data. When multiclass 
> logistic regression is implemented, the excess will be numFeatures + 
> numClasses * numFeatures.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-17 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16000:
--
Assignee: yuhao yang

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: yuhao yang
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-17 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336446#comment-15336446
 ] 

Xiangrui Meng commented on SPARK-16000:
---

That's great! Please let me know if you want to split the task into smaller 
ones. This is a little time sensitive because RC1 might come soon.

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-15947.
-

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15946) Wrap the conversion utils in Python

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-15946:
-

Assignee: Xiangrui Meng

> Wrap the conversion utils in Python
> ---
>
> Key: SPARK-15946
> URL: https://issues.apache.org/jira/browse/SPARK-15946
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> This is to wrap SPARK-15945 in Python. So Python users can use it to convert 
> DataFrames with vector columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Summary: Make pipeline components backward compatible with old vector 
columns in Scala/Java  (was: Make pipeline components backward compatible with 
old vector columns)

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng closed SPARK-15948.
-
Resolution: Won't Fix

> Make pipeline components backward compatible with old vector columns in Python
> --
>
> Key: SPARK-15948
> URL: https://issues.apache.org/jira/browse/SPARK-15948
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> Same as SPARK-15947 but for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng resolved SPARK-15947.
---
Resolution: Won't Fix

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python

2016-06-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334732#comment-15334732
 ] 

Xiangrui Meng commented on SPARK-15948:
---

Marked this as "Won't Do". See SPARK-15947 for reasons.

> Make pipeline components backward compatible with old vector columns in Python
> --
>
> Key: SPARK-15948
> URL: https://issues.apache.org/jira/browse/SPARK-15948
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> Same as SPARK-15947 but for Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15643) ML 2.0 QA: migration guide update

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15643:
--
Assignee: Yanbo Liang

> ML 2.0 QA: migration guide update
> -
>
> Key: SPARK-15643
> URL: https://issues.apache.org/jira/browse/SPARK-15643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Update spark.ml and spark.mllib migration guide from 1.6 to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update

2016-06-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334729#comment-15334729
 ] 

Xiangrui Meng commented on SPARK-15643:
---

[~yanboliang] Please include a paragraph to help users convert vector columns. 
See https://issues.apache.org/jira/browse/SPARK-15947.

> ML 2.0 QA: migration guide update
> -
>
> Key: SPARK-15643
> URL: https://issues.apache.org/jira/browse/SPARK-15643
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Yanbo Liang
>Priority: Blocker
>
> Update spark.ml and spark.mllib migration guide from 1.6 to 2.0.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15947) Make pipeline components backward compatible with old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334725#comment-15334725
 ] 

Xiangrui Meng edited comment on SPARK-15947 at 6/16/16 9:30 PM:


Had an offline discussion with [~josephkb]. There would be lot of work to 
implement this feature and tests. A simpler choice is to ask users to manually 
convert the DataFrames at the beginning of the pipeline with tools implemented 
in SPARK-15945. Then we can update migration guide (SPARK-15643) to include the 
error message and put this workaround there. So users can search on Google and 
find the solution.

I'm closing this ticket.


was (Author: mengxr):
Had an offline discussion with [~josephkb]. There would be lot of work to 
implement this feature and tests. A simpler choice is to ask users to manually 
convert the DataFrames at the beginning of the pipeline with tools implemented 
in SPARK-15945. Then we can update migration guide to include the error message 
and put this workaround there. So users can search on Google and find the 
solution.

I'm closing this ticket.

> Make pipeline components backward compatible with old vector columns
> 
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15947) Make pipeline components backward compatible with old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334725#comment-15334725
 ] 

Xiangrui Meng commented on SPARK-15947:
---

Had an offline discussion with [~josephkb]. There would be lot of work to 
implement this feature and tests. A simpler choice is to ask users to manually 
convert the DataFrames at the beginning of the pipeline with tools implemented 
in SPARK-15945. Then we can update migration guide to include the error message 
and put this workaround there. So users can search on Google and find the 
solution.

I'm closing this ticket.

> Make pipeline components backward compatible with old vector columns
> 
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Description: 
After SPARK-15945, we should make ALL pipeline components accept old vector 
columns as input and do the conversion automatically (probably with a warning 
message), in order to smooth the migration to 2.0. 

--Note that this includes loading old saved models.-- SPARK-16000 handles 
backward compatibility in model loading.

  was:
After SPARK-15945, we should make ALL pipeline components accept old vector 
columns as input and do the conversion automatically (probably with a warning 
message), in order to smooth the migration to 2.0. 

--Note that this includes loading old saved models.-- SPARK-15948 handles 
backward compatibility in model loading.


> Make pipeline components backward compatible with old vector columns
> 
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-16000 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16000:
--
Description: To help users migrate from Spark 1.6. to 2.0, we should make 
model loading backward compatible with models saved in 1.6. The main 
incompatibility is the vector column type change.

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>
> To help users migrate from Spark 1.6. to 2.0, we should make model loading 
> backward compatible with models saved in 1.6. The main incompatibility is the 
> vector column type change.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16000:
--
Summary: Make model loading backward compatible with saved models using old 
vector columns  (was: Make model loading backward compatible with saved models 
using old vector columns in Scala/Java)

> Make model loading backward compatible with saved models using old vector 
> columns
> -
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Description: 
After SPARK-15945, we should make ALL pipeline components accept old vector 
columns as input and do the conversion automatically (probably with a warning 
message), in order to smooth the migration to 2.0. 

--Note that this includes loading old saved models.-- SPARK-15948 handles 
backward compatibility in model loading.

  was:After SPARK-15945, we should make ALL pipeline components accept old 
vector columns as input and do the conversion automatically (probably with a 
warning message), in order to smooth the migration to 2.0. Note that this 
includes loading old saved models.


> Make pipeline components backward compatible with old vector columns
> 
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. 
> --Note that this includes loading old saved models.-- SPARK-15948 handles 
> backward compatibility in model loading.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-15947:
--
Summary: Make pipeline components backward compatible with old vector 
columns  (was: Make pipeline components backward compatible with old vector 
columns in Scala/Java)

> Make pipeline components backward compatible with old vector columns
> 
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. Note that this includes 
> loading old saved models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns

2016-06-16 Thread Xiangrui Meng (JIRA)
Xiangrui Meng created SPARK-16000:
-

 Summary: Make model loading backward compatible with saved models 
using old vector columns
 Key: SPARK-16000
 URL: https://issues.apache.org/jira/browse/SPARK-16000
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Xiangrui Meng






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns in Scala/Java

2016-06-16 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng updated SPARK-16000:
--
Summary: Make model loading backward compatible with saved models using old 
vector columns in Scala/Java  (was: Make model loading backward compatible with 
saved models using old vector columns)

> Make model loading backward compatible with saved models using old vector 
> columns in Scala/Java
> ---
>
> Key: SPARK-16000
> URL: https://issues.apache.org/jira/browse/SPARK-16000
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Reporter: Xiangrui Meng
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java

2016-06-15 Thread Xiangrui Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiangrui Meng reassigned SPARK-15947:
-

Assignee: Xiangrui Meng

> Make pipeline components backward compatible with old vector columns in 
> Scala/Java
> --
>
> Key: SPARK-15947
> URL: https://issues.apache.org/jira/browse/SPARK-15947
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 2.0.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> After SPARK-15945, we should make ALL pipeline components accept old vector 
> columns as input and do the conversion automatically (probably with a warning 
> message), in order to smooth the migration to 2.0. Note that this includes 
> loading old saved models.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    3   4   5   6   7   8   9   10   11   12   >