[jira] [Resolved] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16107. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13820 [https://github.com/apache/spark/pull/13820] > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Labels: starter > Fix For: 2.0.0 > > > Group API docs of spark.glm, glm, predict(GLM), summary(GLM), > read/write.ml(GLM) under Rd spark.glm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16118) getDropLast is missing in OneHotEncoder
[ https://issues.apache.org/jira/browse/SPARK-16118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16118. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13821 [https://github.com/apache/spark/pull/13821] > getDropLast is missing in OneHotEncoder > --- > > Key: SPARK-16118 > URL: https://issues.apache.org/jira/browse/SPARK-16118 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.5.2, 1.6.1, 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > Fix For: 2.0.0 > > > We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16118) getDropLast is missing in OneHotEncoder
Xiangrui Meng created SPARK-16118: - Summary: getDropLast is missing in OneHotEncoder Key: SPARK-16118 URL: https://issues.apache.org/jira/browse/SPARK-16118 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.6.1, 1.5.2, 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng We forgot the getter of dropLast in OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16117) Hide LibSVMFileFormat in public API docs
Xiangrui Meng created SPARK-16117: - Summary: Hide LibSVMFileFormat in public API docs Key: SPARK-16117 URL: https://issues.apache.org/jira/browse/SPARK-16117 Project: Spark Issue Type: Improvement Components: Documentation, MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng LibSVMFileFormat implements data source for LIBSVM format. However, users do not need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy object to hold the documentation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-16113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-16113. - Resolution: Not A Problem > Deprecate (or remove) multiclass APIs in ml.LogisticRegression > -- > > Key: SPARK-16113 > URL: https://issues.apache.org/jira/browse/SPARK-16113 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Based on the discussion in SPARK-7159, we are going to create a separate > class for multinomial logistic regression. So we should deprecate the methods > in ml.LogisticRegression that was made for multiclass support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression
[ https://issues.apache.org/jira/browse/SPARK-16113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342642#comment-15342642 ] Xiangrui Meng commented on SPARK-16113: --- Just realized that `thresholds` was inherited from `ProbabilisticClassifierParams`. So let's keep it to be consistent with other classifiers. > Deprecate (or remove) multiclass APIs in ml.LogisticRegression > -- > > Key: SPARK-16113 > URL: https://issues.apache.org/jira/browse/SPARK-16113 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Based on the discussion in SPARK-7159, we are going to create a separate > class for multinomial logistic regression. So we should deprecate the methods > in ml.LogisticRegression that was made for multiclass support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16113) Deprecate (or remove) multiclass APIs in ml.LogisticRegression
Xiangrui Meng created SPARK-16113: - Summary: Deprecate (or remove) multiclass APIs in ml.LogisticRegression Key: SPARK-16113 URL: https://issues.apache.org/jira/browse/SPARK-16113 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Based on the discussion in SPARK-7159, we are going to create a separate class for multinomial logistic regression. So we should deprecate the methods in ml.LogisticRegression that was made for multiclass support. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs
[ https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498 ] Xiangrui Meng edited comment on SPARK-16111 at 6/21/16 7:26 PM: Ping [~rajesh.balamohan] was (Author: mengxr): Ping [~rbalamohan] > Hide SparkOrcNewRecordReader in API docs > > > Key: SPARK-16111 > URL: https://issues.apache.org/jira/browse/SPARK-16111 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Reporter: Xiangrui Meng >Priority: Minor > > We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it > appears on the top of the list in the Scala API doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs
[ https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498 ] Xiangrui Meng edited comment on SPARK-16111 at 6/21/16 7:26 PM: Ping [~rbalamohan] was (Author: mengxr): Ping [~ rbalamohan] > Hide SparkOrcNewRecordReader in API docs > > > Key: SPARK-16111 > URL: https://issues.apache.org/jira/browse/SPARK-16111 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Reporter: Xiangrui Meng >Priority: Minor > > We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it > appears on the top of the list in the Scala API doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs
[ https://issues.apache.org/jira/browse/SPARK-16111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342498#comment-15342498 ] Xiangrui Meng commented on SPARK-16111: --- Ping [~ rbalamohan] > Hide SparkOrcNewRecordReader in API docs > > > Key: SPARK-16111 > URL: https://issues.apache.org/jira/browse/SPARK-16111 > Project: Spark > Issue Type: Documentation > Components: Documentation, SQL >Reporter: Xiangrui Meng >Priority: Minor > > We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it > appears on the top of the list in the Scala API doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16111) Hide SparkOrcNewRecordReader in API docs
Xiangrui Meng created SPARK-16111: - Summary: Hide SparkOrcNewRecordReader in API docs Key: SPARK-16111 URL: https://issues.apache.org/jira/browse/SPARK-16111 Project: Spark Issue Type: Documentation Components: Documentation, SQL Reporter: Xiangrui Meng Priority: Minor We should exclude SparkOrcNewRecordReader from API docs. Otherwise, it appears on the top of the list in the Scala API doc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments
[ https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16086: -- Fix Version/s: 1.6.2 1.5.3 > Python UDF failed when there is no arguments > > > Key: SPARK-16086 > URL: https://issues.apache.org/jira/browse/SPARK-16086 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 1.6.1 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.2, 2.0.0 > > > {code} > >>> sqlContext.registerFunction("f", lambda : "a") > >>> sqlContext.sql("select f()").show() > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 > (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 111, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 106, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/serializers.py", line 263, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/databricks/spark/python/pyspark/serializers.py", line 139, in > load_stream > yield self._read_with_length(stream) > File "/databricks/spark/python/pyspark/serializers.py", line 164, in > _read_with_length > return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads > return pickle.loads(obj) > File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in > return lambda *a: dataType.fromInternal(a) > File "/databricks/spark/python/pyspark/sql/types.py", line 568, in > fromInternal > return _create_row(self.names, values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in > _create_row > row = Row(*values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__ > raise ValueError("No args or kwargs") > ValueError: (ValueError('No args or kwargs',), at > 0x7f3bbc463320>, ()) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72) > at org.apache.spark.scheduler.Task.run(Task.scala:96) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None
[ https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15741: -- Assignee: Bryan Cutler > PySpark Cleanup of _setDefault with seed=None > - > > Key: SPARK-15741 > URL: https://issues.apache.org/jira/browse/SPARK-15741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.0.0 > > > Several places in PySpark ML have Params._setDefault with a seed param equal > to {{None}}. This is unnecessary as it will translate to a {{0}} even though > the param has a fixed value based by on the hashed classname by default. > Currently, the ALS doc test output depends on this happening and would be > more clear and stable if it was explicitly set to {{0}}. These should be > cleaned up for stability and consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None
[ https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15741: -- Target Version/s: 2.0.0 > PySpark Cleanup of _setDefault with seed=None > - > > Key: SPARK-15741 > URL: https://issues.apache.org/jira/browse/SPARK-15741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Assignee: Bryan Cutler >Priority: Minor > Fix For: 2.0.0 > > > Several places in PySpark ML have Params._setDefault with a seed param equal > to {{None}}. This is unnecessary as it will translate to a {{0}} even though > the param has a fixed value based by on the hashed classname by default. > Currently, the ALS doc test output depends on this happening and would be > more clear and stable if it was explicitly set to {{0}}. These should be > cleaned up for stability and consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15741) PySpark Cleanup of _setDefault with seed=None
[ https://issues.apache.org/jira/browse/SPARK-15741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15741. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13672 [https://github.com/apache/spark/pull/13672] > PySpark Cleanup of _setDefault with seed=None > - > > Key: SPARK-15741 > URL: https://issues.apache.org/jira/browse/SPARK-15741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Bryan Cutler >Priority: Minor > Fix For: 2.0.0 > > > Several places in PySpark ML have Params._setDefault with a seed param equal > to {{None}}. This is unnecessary as it will translate to a {{0}} even though > the param has a fixed value based by on the hashed classname by default. > Currently, the ALS doc test output depends on this happening and would be > more clear and stable if it was explicitly set to {{0}}. These should be > cleaned up for stability and consistency. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16107: -- Assignee: Junyang Qian > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > > spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16107: -- Labels: starter (was: ) > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Labels: starter > > spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16107: -- Description: Group API docs of spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) under Rd spark.glm. (was: spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM)) > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Junyang Qian > Labels: starter > > Group API docs of spark.glm, glm, predict(GLM), summary(GLM), > read/write.ml(GLM) under Rd spark.glm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16107: -- Description: spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16107) Group GLM-related methods in generated doc
[ https://issues.apache.org/jira/browse/SPARK-16107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342178#comment-15342178 ] Xiangrui Meng commented on SPARK-16107: --- ping [~junyangq] > Group GLM-related methods in generated doc > -- > > Key: SPARK-16107 > URL: https://issues.apache.org/jira/browse/SPARK-16107 > Project: Spark > Issue Type: Sub-task > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16107) Group GLM-related methods in generated doc
Xiangrui Meng created SPARK-16107: - Summary: Group GLM-related methods in generated doc Key: SPARK-16107 URL: https://issues.apache.org/jira/browse/SPARK-16107 Project: Spark Issue Type: Sub-task Components: Documentation, SparkR Affects Versions: 2.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342173#comment-15342173 ] Xiangrui Meng commented on SPARK-16090: --- I changed the issue type to umbrella since there could be well-separated sub-tasks. > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16090: -- Issue Type: Umbrella (was: Improvement) > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Umbrella > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: make SparkR model params and default values consistent with MLlib
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15177: -- Summary: SparkR 2.0 QA: make SparkR model params and default values consistent with MLlib (was: SparkR 2.0 QA: New R APIs and API docs for mllib.R) > SparkR 2.0 QA: make SparkR model params and default values consistent with > MLlib > > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.0.0 > > > Audit new public R APIs in mllib.R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15177: -- Description: Audit new public R APIs in mllib.R (was: Audit new public R APIs in mllib.R.) > SparkR 2.0 QA: New R APIs and API docs for mllib.R > -- > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.0.0 > > > Audit new public R APIs in mllib.R -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: make SparkR model params and default values consistent with MLlib
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15177: -- Description: Make SparkR model params and default values consistent with MLlib (was: Audit new public R APIs in mllib.R) > SparkR 2.0 QA: make SparkR model params and default values consistent with > MLlib > > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.0.0 > > > Make SparkR model params and default values consistent with MLlib -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15177: -- Shepherd: Xiangrui Meng > SparkR 2.0 QA: New R APIs and API docs for mllib.R > -- > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.0.0 > > > Audit new public R APIs in mllib.R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15177. --- Resolution: Fixed Fix Version/s: 2.0.0 Target Version/s: 2.0.0 I marked this JIRA as resolved. The API doc changes would be merged into SPARK-16090, which affects how we write API docs. > SparkR 2.0 QA: New R APIs and API docs for mllib.R > -- > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > Fix For: 2.0.0 > > > Audit new public R APIs in mllib.R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15177) SparkR 2.0 QA: New R APIs and API docs for mllib.R
[ https://issues.apache.org/jira/browse/SPARK-15177?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15177: -- Assignee: Yanbo Liang > SparkR 2.0 QA: New R APIs and API docs for mllib.R > -- > > Key: SPARK-15177 > URL: https://issues.apache.org/jira/browse/SPARK-15177 > Project: Spark > Issue Type: Documentation > Components: ML, SparkR >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > > Audit new public R APIs in mllib.R. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341330#comment-15341330 ] Xiangrui Meng commented on SPARK-16071: --- [~ding] This JIRA is not to solve this particular issue only. It would be nice to make a pass over the implementation of UnsafeArrayData and relevant classes to find issues like this and put the size check early. It would be nice to mention the size limit in user guide and error messages too. > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of later throwing NegativeArraySize (which is > slow and might cause silent errors) > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but slow (see SPARK-16043) > * n=2e8: NegativeArraySize exception > {code:none} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:85) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > * n=3e8: NegativeArraySize exception but raised at a different location > {code:none} > java.lang.RuntimeException: Error while encoding: > java.lang.NegativeArraySizeException > newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS > value#108 > +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) >+- input[0, [I, true] > at > org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at > org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) > at >
[jira] [Resolved] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16045. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13375 [https://github.com/apache/spark/pull/13375] > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: yuhao yang >Priority: Minor > Fix For: 2.0.0 > > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16045: -- Assignee: yuhao yang > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.0 > > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16045) Spark 2.0 ML.feature: doc update for stopwords and binarizer
[ https://issues.apache.org/jira/browse/SPARK-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16045: -- Affects Version/s: 2.0.0 Target Version/s: 2.0.0 > Spark 2.0 ML.feature: doc update for stopwords and binarizer > > > Key: SPARK-16045 > URL: https://issues.apache.org/jira/browse/SPARK-16045 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.0 >Reporter: yuhao yang >Assignee: yuhao yang >Priority: Minor > Fix For: 2.0.0 > > > 2.0 Audit: Update document for StopWordsRemover (load stop words) and > Binarizer (support of Vector) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7751) Add @Since annotation to stable and experimental methods in MLlib
[ https://issues.apache.org/jira/browse/SPARK-7751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-7751. -- Resolution: Fixed Fix Version/s: 2.0.0 Mark this umbrella as resolved since all sub-tasks are done. Thanks everyone for contributing!! > Add @Since annotation to stable and experimental methods in MLlib > - > > Key: SPARK-7751 > URL: https://issues.apache.org/jira/browse/SPARK-7751 > Project: Spark > Issue Type: Umbrella > Components: Documentation, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > > This is useful to check whether a feature exists in some version of Spark. > This is an umbrella JIRA to track the progress. We want to have -@since tag- > @Since annotation for both stable (those without any > Experimental/DeveloperApi/AlphaComponent annotations) and experimental > methods in MLlib: > (Do NOT tag private or package private classes or methods, nor local > variables and methods.) > * an example PR for Scala: https://github.com/apache/spark/pull/8309 > We need to dig the history of git commit to figure out what was the Spark > version when a method was first introduced. Take `NaiveBayes.setModelType` as > an example. We can grep `def setModelType` at different version git tags. > {code} > meng@xm:~/src/spark > $ git show > v1.3.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > | grep "def setModelType" > meng@xm:~/src/spark > $ git show > v1.4.0:mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala > | grep "def setModelType" > def setModelType(modelType: String): NaiveBayes = { > {code} > If there are better ways, please let us know. > We cannot add all -@since tags- @Since annotation in a single PR, which is > hard to review. So we made some subtasks for each package, for example > `org.apache.spark.classification`. Feel free to add more sub-tasks for Python > and the `spark.ml` package. > Plan: > 1. In 1.5, we try to add @Since annotation to all stable/experimental methods > under `spark.mllib`. > 2. Starting from 1.6, we require @Since annotation in all new PRs. > 3. In 1.6, we try to add @SInce annotation to all stable/experimental methods > under `spark.ml`, `pyspark.mllib`, and `pyspark.ml`. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-10258) Add @Since annotation to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-10258: -- Shepherd: Nick Pentreath > Add @Since annotation to ml.feature > --- > > Key: SPARK-10258 > URL: https://issues.apache.org/jira/browse/SPARK-10258 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Martin Brown >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-10258) Add @Since annotation to ml.feature
[ https://issues.apache.org/jira/browse/SPARK-10258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-10258. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13641 [https://github.com/apache/spark/pull/13641] > Add @Since annotation to ml.feature > --- > > Key: SPARK-10258 > URL: https://issues.apache.org/jira/browse/SPARK-10258 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML >Reporter: Xiangrui Meng >Assignee: Martin Brown >Priority: Minor > Labels: starter > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16086) Python UDF failed when there is no arguments
[ https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341297#comment-15341297 ] Xiangrui Meng commented on SPARK-16086: --- Reverted the changes in master and branch-2.0 since they broke the builds: * https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-sbt-hadoop-2.2/240/console * https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-sbt-hadoop-2.2/1212/consoleFull > Python UDF failed when there is no arguments > > > Key: SPARK-16086 > URL: https://issues.apache.org/jira/browse/SPARK-16086 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 1.6.1 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.2 > > > {code} > >>> sqlContext.registerFunction("f", lambda : "a") > >>> sqlContext.sql("select f()").show() > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 > (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 111, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 106, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/serializers.py", line 263, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/databricks/spark/python/pyspark/serializers.py", line 139, in > load_stream > yield self._read_with_length(stream) > File "/databricks/spark/python/pyspark/serializers.py", line 164, in > _read_with_length > return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads > return pickle.loads(obj) > File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in > return lambda *a: dataType.fromInternal(a) > File "/databricks/spark/python/pyspark/sql/types.py", line 568, in > fromInternal > return _create_row(self.names, values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in > _create_row > row = Row(*values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__ > raise ValueError("No args or kwargs") > ValueError: (ValueError('No args or kwargs',), at > 0x7f3bbc463320>, ()) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72) > at org.apache.spark.scheduler.Task.run(Task.scala:96) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Reopened] (SPARK-16086) Python UDF failed when there is no arguments
[ https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reopened SPARK-16086: --- > Python UDF failed when there is no arguments > > > Key: SPARK-16086 > URL: https://issues.apache.org/jira/browse/SPARK-16086 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 1.6.1 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.2 > > > {code} > >>> sqlContext.registerFunction("f", lambda : "a") > >>> sqlContext.sql("select f()").show() > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 > (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 111, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 106, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/serializers.py", line 263, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/databricks/spark/python/pyspark/serializers.py", line 139, in > load_stream > yield self._read_with_length(stream) > File "/databricks/spark/python/pyspark/serializers.py", line 164, in > _read_with_length > return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads > return pickle.loads(obj) > File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in > return lambda *a: dataType.fromInternal(a) > File "/databricks/spark/python/pyspark/sql/types.py", line 568, in > fromInternal > return _create_row(self.names, values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in > _create_row > row = Row(*values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__ > raise ValueError("No args or kwargs") > ValueError: (ValueError('No args or kwargs',), at > 0x7f3bbc463320>, ()) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72) > at org.apache.spark.scheduler.Task.run(Task.scala:96) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16086) Python UDF failed when there is no arguments
[ https://issues.apache.org/jira/browse/SPARK-16086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16086: -- Fix Version/s: (was: 2.0.0) > Python UDF failed when there is no arguments > > > Key: SPARK-16086 > URL: https://issues.apache.org/jira/browse/SPARK-16086 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.5.2, 1.6.1 >Reporter: Davies Liu >Assignee: Davies Liu > Fix For: 1.5.3, 1.6.2 > > > {code} > >>> sqlContext.registerFunction("f", lambda : "a") > >>> sqlContext.sql("select f()").show() > {code} > {code} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in > stage 171.0 failed 4 times, most recent failure: Lost task 0.3 in stage 171.0 > (TID 6226, ip-10-0-243-36.us-west-2.compute.internal): > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 111, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 106, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/serializers.py", line 263, in > dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/databricks/spark/python/pyspark/serializers.py", line 139, in > load_stream > yield self._read_with_length(stream) > File "/databricks/spark/python/pyspark/serializers.py", line 164, in > _read_with_length > return self.loads(obj) > File "/databricks/spark/python/pyspark/serializers.py", line 422, in loads > return pickle.loads(obj) > File "/databricks/spark/python/pyspark/sql/types.py", line 1159, in > return lambda *a: dataType.fromInternal(a) > File "/databricks/spark/python/pyspark/sql/types.py", line 568, in > fromInternal > return _create_row(self.names, values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1163, in > _create_row > row = Row(*values) > File "/databricks/spark/python/pyspark/sql/types.py", line 1210, in __new__ > raise ValueError("No args or kwargs") > ValueError: (ValueError('No args or kwargs',), at > 0x7f3bbc463320>, ()) > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:207) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:405) > at > org.apache.spark.sql.execution.BatchPythonEvaluation$$anonfun$doExecute$1.apply(python.scala:370) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$20.apply(RDD.scala:710) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:72) > at org.apache.spark.scheduler.Task.run(Task.scala:96) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:222) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341258#comment-15341258 ] Xiangrui Meng commented on SPARK-16090: --- For ML methods, I'd like to propose the following grouping: * spark.glm: spark.glm, glm, predict(GLM), summary(GLM), read/write.ml(GLM) * spark.naiveBayes: spark.naiveBayes, predict(NB), summary(NB), read/write.ml(NB) * spark.kmeans: spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) * spark.survreg: .spark.survreg, predict(SR), summary(SR), read/write.ml(SR) Then put a separate doc page for each generic method, including predict, summary, read/write.ml, and link to the doc pages above using see also. > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16090) Improve method grouping in SparkR generated docs
[ https://issues.apache.org/jira/browse/SPARK-16090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16090: -- Description: This JIRA follows the discussion on https://github.com/apache/spark/pull/13109 to improve method grouping in SparkR generated docs. Having one method per doc page is not an R convention. However, having many methods per doc page would hurt the readability. So a proper grouping would help. Since we use roxygen2 instead of writing Rd files directly, we should consider smaller groups to avoid confusion. > Improve method grouping in SparkR generated docs > > > Key: SPARK-16090 > URL: https://issues.apache.org/jira/browse/SPARK-16090 > Project: Spark > Issue Type: Improvement > Components: Documentation, SparkR >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > This JIRA follows the discussion on > https://github.com/apache/spark/pull/13109 to improve method grouping in > SparkR generated docs. Having one method per doc page is not an R convention. > However, having many methods per doc page would hurt the readability. So a > proper grouping would help. Since we use roxygen2 instead of writing Rd files > directly, we should consider smaller groups to avoid confusion. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16090) Improve method grouping in SparkR generated docs
Xiangrui Meng created SPARK-16090: - Summary: Improve method grouping in SparkR generated docs Key: SPARK-16090 URL: https://issues.apache.org/jira/browse/SPARK-16090 Project: Spark Issue Type: Improvement Components: Documentation, SparkR Affects Versions: 2.0.0 Reporter: Xiangrui Meng Priority: Critical -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16079. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13787 [https://github.com/apache/spark/pull/13787] > PySpark ML classification missing import of DecisionTreeRegressionModel for > GBTClassificationModel > -- > > Key: SPARK-16079 > URL: https://issues.apache.org/jira/browse/SPARK-16079 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Bryan Cutler > Fix For: 2.0.0 > > > In GBTClassificationModel, the overloaded method {{trees}} casts the > DecisionTree to a DecisionTreeRegressionModel, however, the import for this > class is missing and leads to a {{NameError}} > {noformat} > spark/python/pyspark/ml/classification.pyc in trees(self) > 888 def trees(self): > 889 """Trees in this ensemble. Warning: These have null parent > Estimators.""" > --> 890 return [DecisionTreeRegressionModel(m) for m in > list(self._call_java("trees"))] > 891 > 892 > NameError: global name 'DecisionTreeRegressionModel' is not defined > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16079: -- Assignee: Bryan Cutler > PySpark ML classification missing import of DecisionTreeRegressionModel for > GBTClassificationModel > -- > > Key: SPARK-16079 > URL: https://issues.apache.org/jira/browse/SPARK-16079 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Bryan Cutler >Assignee: Bryan Cutler > Fix For: 2.0.0 > > > In GBTClassificationModel, the overloaded method {{trees}} casts the > DecisionTree to a DecisionTreeRegressionModel, however, the import for this > class is missing and leads to a {{NameError}} > {noformat} > spark/python/pyspark/ml/classification.pyc in trees(self) > 888 def trees(self): > 889 """Trees in this ensemble. Warning: These have null parent > Estimators.""" > --> 890 return [DecisionTreeRegressionModel(m) for m in > list(self._call_java("trees"))] > 891 > 892 > NameError: global name 'DecisionTreeRegressionModel' is not defined > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16079) PySpark ML classification missing import of DecisionTreeRegressionModel for GBTClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-16079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16079: -- Affects Version/s: 2.0.0 > PySpark ML classification missing import of DecisionTreeRegressionModel for > GBTClassificationModel > -- > > Key: SPARK-16079 > URL: https://issues.apache.org/jira/browse/SPARK-16079 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.0.0 >Reporter: Bryan Cutler > > In GBTClassificationModel, the overloaded method {{trees}} casts the > DecisionTree to a DecisionTreeRegressionModel, however, the import for this > class is missing and leads to a {{NameError}} > {noformat} > spark/python/pyspark/ml/classification.pyc in trees(self) > 888 def trees(self): > 889 """Trees in this ensemble. Warning: These have null parent > Estimators.""" > --> 890 return [DecisionTreeRegressionModel(m) for m in > list(self._call_java("trees"))] > 891 > 892 > NameError: global name 'DecisionTreeRegressionModel' is not defined > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API
[ https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340642#comment-15340642 ] Xiangrui Meng commented on SPARK-16074: --- Picked option 2) because we don't have any Java source code in MLlib. The overhead for Java users is the extra `()`. > Expose VectorUDT/MatrixUDT in a public API > -- > > Key: SPARK-16074 > URL: https://issues.apache.org/jira/browse/SPARK-16074 > Project: Spark > Issue Type: New Feature > Components: MLilb >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself > is private in Spark. However, in order to let developers implement their own > transformers and estimators, we should expose both types in a public API to > simply the implementation of transformSchema, transform, etc. Otherwise, they > need to get the data types using reflection. > Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can > just have a method or a static value that returns VectorUDT/MatrixUDT > instance with DataType as the return type. There are two ways to implement > this: > 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()". > 2. Define DataTypes in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API
[ https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-16074: - Assignee: Xiangrui Meng > Expose VectorUDT/MatrixUDT in a public API > -- > > Key: SPARK-16074 > URL: https://issues.apache.org/jira/browse/SPARK-16074 > Project: Spark > Issue Type: New Feature > Components: MLilb >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng >Priority: Critical > > Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself > is private in Spark. However, in order to let developers implement their own > transformers and estimators, we should expose both types in a public API to > simply the implementation of transformSchema, transform, etc. Otherwise, they > need to get the data types using reflection. > Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can > just have a method or a static value that returns VectorUDT/MatrixUDT > instance with DataType as the return type. There are two ways to implement > this: > 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()". > 2. Define DataTypes in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15340419#comment-15340419 ] Xiangrui Meng commented on SPARK-16075: --- I'm not sure whether we should make this change in 2.0. It is not a trivial change, though it brings some benefits. > Make VectorUDT/MatrixUDT singleton under spark.ml package > - > > Key: SPARK-16075 > URL: https://issues.apache.org/jira/browse/SPARK-16075 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Both VectorUDT and MatrixUDT are implemented as normal classes and their > could be multiple instances of it, which makes the equality checking and > pattern matching harder to implement. Even the APIs are private, switching to > a singleton pattern could simplify the development. > Required changes: > * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate) > * update UDTRegistration > * update code generation to support singleton UDTs > * update existing code to use getOrCreate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package
[ https://issues.apache.org/jira/browse/SPARK-16075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16075: -- Description: Both VectorUDT and MatrixUDT are implemented as normal classes and their could be multiple instances of it, which makes the equality checking and pattern matching harder to implement. Even the APIs are private, switching to a singleton pattern could simplify the development. Required changes: * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate) * update UDTRegistration * update code generation to support singleton UDTs * update existing code to use getOrCreate was: Both VectorUDT and MatrixUDT are implemented as normal classes and their could be multiple instances of it, which makes the equality checking and pattern matching harder to implement. Even the APIs are private, switching to a singleton pattern could simplify the development. Required changes: * singleton VectorUDT/MatrixUDT * add UDTFactory trait with getOrCreate to return the singleton instance * update UDTRegistration * update code generation to support UDTFactory > Make VectorUDT/MatrixUDT singleton under spark.ml package > - > > Key: SPARK-16075 > URL: https://issues.apache.org/jira/browse/SPARK-16075 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > Both VectorUDT and MatrixUDT are implemented as normal classes and their > could be multiple instances of it, which makes the equality checking and > pattern matching harder to implement. Even the APIs are private, switching to > a singleton pattern could simplify the development. > Required changes: > * singleton VectorUDT/MatrixUDT (created by VectorUDT.getOrCreate) > * update UDTRegistration > * update code generation to support singleton UDTs > * update existing code to use getOrCreate -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API
[ https://issues.apache.org/jira/browse/SPARK-16074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16074: -- Description: Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just have a method or a static value that returns VectorUDT/MatrixUDT instance with DataType as the return type. There are two ways to implement this: 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()". 2. Define DataTypes in Scala. was: Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just have a method or a static value that returns VectorUDT/MatrixUDT instance with DataType as the return type. > Expose VectorUDT/MatrixUDT in a public API > -- > > Key: SPARK-16074 > URL: https://issues.apache.org/jira/browse/SPARK-16074 > Project: Spark > Issue Type: New Feature > Components: MLilb >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself > is private in Spark. However, in order to let developers implement their own > transformers and estimators, we should expose both types in a public API to > simply the implementation of transformSchema, transform, etc. Otherwise, they > need to get the data types using reflection. > Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can > just have a method or a static value that returns VectorUDT/MatrixUDT > instance with DataType as the return type. There are two ways to implement > this: > 1. following DataTypes.java in SQL, so Java users doesn't need the extra "()". > 2. Define DataTypes in Scala. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16075) Make VectorUDT/MatrixUDT singleton under spark.ml package
Xiangrui Meng created SPARK-16075: - Summary: Make VectorUDT/MatrixUDT singleton under spark.ml package Key: SPARK-16075 URL: https://issues.apache.org/jira/browse/SPARK-16075 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.0.0 Reporter: Xiangrui Meng Assignee: Xiangrui Meng Both VectorUDT and MatrixUDT are implemented as normal classes and their could be multiple instances of it, which makes the equality checking and pattern matching harder to implement. Even the APIs are private, switching to a singleton pattern could simplify the development. Required changes: * singleton VectorUDT/MatrixUDT * add UDTFactory trait with getOrCreate to return the singleton instance * update UDTRegistration * update code generation to support UDTFactory -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16074) Expose VectorUDT/MatrixUDT in a public API
Xiangrui Meng created SPARK-16074: - Summary: Expose VectorUDT/MatrixUDT in a public API Key: SPARK-16074 URL: https://issues.apache.org/jira/browse/SPARK-16074 Project: Spark Issue Type: New Feature Components: MLilb Affects Versions: 2.0.0 Reporter: Xiangrui Meng Priority: Critical Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. Note that this doesn't mean to expose VectorUDT/MatrixUDT classes. We can just have a method or a static value that returns VectorUDT/MatrixUDT instance with DataType as the return type. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16073: -- Description: Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet data. However, Parquet also has its own encodings to compress columns/arrays, e.g., dictionary encoding: https://github.com/apache/parquet-format/blob/master/Encodings.md. It might be worth checking the performance overhead of Parquet encodings on saving large primitive arrays, which is a machine learning use case. If the overhead is significant, we should expose a configuration in Spark to control the encoding levels. Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. was: Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet data. However, Parquet also has its own encodings to compress columns/arrays, e.g., dictionary encoding: https://github.com/apache/parquet-format/blob/master/Encodings.md. It might be worth checking the performance overhead of Parquet encodings for saving primitive arrays, which is a machine learning use case. Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. > Performance of Parquet encodings on saving primitive arrays > --- > > Key: SPARK-16073 > URL: https://issues.apache.org/jira/browse/SPARK-16073 > Project: Spark > Issue Type: Task > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet > data. However, Parquet also has its own encodings to compress columns/arrays, > e.g., dictionary encoding: > https://github.com/apache/parquet-format/blob/master/Encodings.md. > It might be worth checking the performance overhead of Parquet encodings on > saving large primitive arrays, which is a machine learning use case. If the > overhead is significant, we should expose a configuration in Spark to control > the encoding levels. > Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16073?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16073: -- Description: Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet data. However, Parquet also has its own encodings to compress columns/arrays, e.g., dictionary encoding: https://github.com/apache/parquet-format/blob/master/Encodings.md. It might be worth checking the performance overhead of Parquet encodings for saving primitive arrays, which is a machine learning use case. Note that this shouldn't be tested under Spark until SPARK-16043 was fixed. > Performance of Parquet encodings on saving primitive arrays > --- > > Key: SPARK-16073 > URL: https://issues.apache.org/jira/browse/SPARK-16073 > Project: Spark > Issue Type: Task > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > Spark supports both uncompressed and compressed (snappy, gzip, lzo) Parquet > data. However, Parquet also has its own encodings to compress columns/arrays, > e.g., dictionary encoding: > https://github.com/apache/parquet-format/blob/master/Encodings.md. > It might be worth checking the performance overhead of Parquet encodings for > saving primitive arrays, which is a machine learning use case. Note that this > shouldn't be tested under Spark until SPARK-16043 was fixed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16073) Performance of Parquet encodings on saving primitive arrays
Xiangrui Meng created SPARK-16073: - Summary: Performance of Parquet encodings on saving primitive arrays Key: SPARK-16073 URL: https://issues.apache.org/jira/browse/SPARK-16073 Project: Spark Issue Type: Task Components: MLlib, SQL Affects Versions: 2.0.0 Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * SPARK-16071: Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) * SPARK-16073: Performance of Parquet encodings on saving primitive arrays was: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * SPARK-16071: Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) * Performance of Parquet encodings on saving primitive arrays > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * SPARK-16071: Not sufficient array size checks > ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] > or silent errors) > * SPARK-16073: Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of later throwing NegativeArraySize (which is slow and might cause silent errors) * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but slow (see SPARK-16043) * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but raised at a different location {code:none} java.lang.RuntimeException: Error while encoding: java.lang.NegativeArraySizeException newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#108 +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) +- input[0, [I, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * SPARK-16071: Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) * Performance of Parquet encodings on saving primitive arrays was: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) * Performance of Parquet encodings on saving primitive arrays > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * SPARK-16071: Not sufficient array size checks > ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] > or silent errors) > * Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of later throwing NegativeArraySize (which is slow and might cause silent errors) * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow (see SPARK-16043) * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but raised at a different location {code:none} java.lang.RuntimeException: Error while encoding: java.lang.NegativeArraySizeException newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#108 +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) +- input[0, [I, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of later throwing NegativeArraySize (which is slow and might cause silent errors) * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow (see SPARK-16043) * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but at a different location {code:none} java.lang.RuntimeException: Error while encoding: java.lang.NegativeArraySizeException newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#108 +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) +- input[0, [I, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but at a different location {code:none} java.lang.RuntimeException: Error while encoding: java.lang.NegativeArraySizeException newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#108 +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) +- input[0, [I, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} was: Several
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow (see SPARK-16043) * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but at a different location {code:none} java.lang.RuntimeException: Error while encoding: java.lang.NegativeArraySizeException newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) AS value#108 +- newInstance(class org.apache.spark.sql.catalyst.util.GenericArrayData) +- input[0, [I, true] at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:257) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at org.apache.spark.sql.SparkSession$$anonfun$4.apply(SparkSession.scala:430) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code}
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception {code:none} java.lang.NegativeArraySizeException at org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) at org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:121) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246) at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319) at org.apache.spark.rdd.RDD.iterator(RDD.scala:283) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {code} * n=3e8: NegativeArraySize exception but at a different location was: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception * n=3e8: NegativeArraySize exception but at a different location > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yin Huai >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of NegativeArraySize > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but with slow > * n=2e8: NegativeArraySize exception > {code:none} > java.lang.NegativeArraySizeException > at > org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder.grow(BufferHolder.java:61) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown > Source) > at > org.apache.spark.sql.execution.RDDScanExec$$anonfun$doExecute$1$$anonfun$apply$3.apply(ExistingRDD.scala:123) > at >
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception * n=3e8: NegativeArraySize exception but at a different location was: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yin Huai >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of NegativeArraySize > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but with slow > * n=2e8: NegativeArraySize exception > * n=3e8: NegativeArraySize exception but at a different location -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Assignee: Yin Huai > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Yin Huai >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of NegativeArraySize > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but with slow > * n=2e8: NegativeArraySize exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Description: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. To reproduce one of the issues: {code} val n = 1e8.toInt // try 2e8, 3e8 sc.parallelize(0 until 1, 1).map(i => new Array[Int](n)).toDS.map(_.size).show() {code} Result: * n=1e8: correct but with slow * n=2e8: NegativeArraySize exception was: Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of NegativeArraySize > * Document clearly the largest array size we support in DataFrames. > To reproduce one of the issues: > {code} > val n = 1e8.toInt // try 2e8, 3e8 > sc.parallelize(0 until 1, 1).map(i => new > Array[Int](n)).toDS.map(_.size).show() > {code} > Result: > * n=1e8: correct but with slow > * n=2e8: NegativeArraySize exception -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16071) Not sufficient array size checks to avoid integer overflows in Tungsten
[ https://issues.apache.org/jira/browse/SPARK-16071?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16071: -- Summary: Not sufficient array size checks to avoid integer overflows in Tungsten (was: Not sufficient size checks to avoid integer overflows in Tungsten) > Not sufficient array size checks to avoid integer overflows in Tungsten > --- > > Key: SPARK-16071 > URL: https://issues.apache.org/jira/browse/SPARK-16071 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Priority: Critical > > Several bugs have been found caused by integer overflows in Tungsten. This > JIRA is for taking a final pass before 2.0 release to reduce potential bugs > and issues. We should do at least the following: > * Raise exception early instead of NegativeArraySize > * Document clearly the largest array size we support in DataFrames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16071) Not sufficient size checks to avoid integer overflows in Tungsten
Xiangrui Meng created SPARK-16071: - Summary: Not sufficient size checks to avoid integer overflows in Tungsten Key: SPARK-16071 URL: https://issues.apache.org/jira/browse/SPARK-16071 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Xiangrui Meng Priority: Critical Several bugs have been found caused by integer overflows in Tungsten. This JIRA is for taking a final pass before 2.0 release to reduce potential bugs and issues. We should do at least the following: * Raise exception early instead of NegativeArraySize * Document clearly the largest array size we support in DataFrames. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays was: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException NegativeArraySizeException]] or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * Not sufficient array size checks > ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] > or silent errors) > ** There > * Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) * Performance of Parquet encodings on saving primitive arrays was: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * Not sufficient array size checks > ([NegativeArraySizeException|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException] > or silent errors) > * Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException NegativeArraySizeException]] or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays was: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException) or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * Not sufficient array size checks > ([[https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException > NegativeArraySizeException]] or silent errors) > ** There > * Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
[ https://issues.apache.org/jira/browse/SPARK-16070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16070: -- Description: I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as (usually large) primitive arrays. Issues: * SPARK-16043: Tungsten array data is not specialized for primitive types * Not sufficient array size checks ([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException) or silent errors) ** There * Performance of Parquet encodings on saving primitive arrays was:I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as primitive arrays. > DataFrame/Parquet issues with primitive arrays > -- > > Key: SPARK-16070 > URL: https://issues.apache.org/jira/browse/SPARK-16070 > Project: Spark > Issue Type: Umbrella > Components: MLlib, SQL >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng > > I created this umbrella JIRA to track DataFrame/Parquet issues with primitive > arrays. This is mostly related to machine learning use cases, where feature > indices/values are stored as (usually large) primitive arrays. > Issues: > * SPARK-16043: Tungsten array data is not specialized for primitive types > * Not sufficient array size checks > ([NegativeArraySizeException](https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20text%20~%20NegativeArraySizeException) > or silent errors) > ** There > * Performance of Parquet encodings on saving primitive arrays -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16070) DataFrame/Parquet issues with primitive arrays
Xiangrui Meng created SPARK-16070: - Summary: DataFrame/Parquet issues with primitive arrays Key: SPARK-16070 URL: https://issues.apache.org/jira/browse/SPARK-16070 Project: Spark Issue Type: Umbrella Components: MLlib, SQL Affects Versions: 2.0.0 Reporter: Xiangrui Meng I created this umbrella JIRA to track DataFrame/Parquet issues with primitive arrays. This is mostly related to machine learning use cases, where feature indices/values are stored as primitive arrays. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16035: -- Assignee: Andrea Pasqua > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Assignee: Andrea Pasqua >Priority: Minor > Fix For: 1.6.2, 2.0.0 > > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16035) The SparseVector parser fails checking for valid end parenthesis
[ https://issues.apache.org/jira/browse/SPARK-16035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16035. --- Resolution: Fixed Fix Version/s: 1.6.2 2.0.0 Issue resolved by pull request 13750 [https://github.com/apache/spark/pull/13750] > The SparseVector parser fails checking for valid end parenthesis > > > Key: SPARK-16035 > URL: https://issues.apache.org/jira/browse/SPARK-16035 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Andrea Pasqua >Priority: Minor > Fix For: 2.0.0, 1.6.2 > > > Running > SparseVector.parse(' (4, [0,1 ],[ 4.0,5.0] ') > will not raise an exception as expected, although it parses it as if there > was an end parenthesis. > This can be fixed by replacing > if start == -1: >raise ValueError("Tuple should end with ')'") > with > if end == -1: >raise ValueError("Tuple should end with ')'") > Please see posted PR -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15129) Clarify conventions for calling Spark and MLlib from R
[ https://issues.apache.org/jira/browse/SPARK-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15129. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13285 [https://github.com/apache/spark/pull/13285] > Clarify conventions for calling Spark and MLlib from R > -- > > Key: SPARK-15129 > URL: https://issues.apache.org/jira/browse/SPARK-15129 > Project: Spark > Issue Type: Documentation > Components: Documentation, ML, SparkR >Reporter: Joseph K. Bradley >Assignee: Gayathri Murali >Priority: Blocker > Fix For: 2.0.0 > > > Since some R API modifications happened in 2.0, we need to make the new > standards clear in the user guide. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15892: -- Fix Version/s: 2.0.0 > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > Fix For: 1.6.2, 2.0.0 > > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15892) Incorrectly merged AFTAggregator with zero total count
[ https://issues.apache.org/jira/browse/SPARK-15892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15892. --- Resolution: Fixed Fix Version/s: (was: 2.0.0) 1.6.2 Issue resolved by pull request 13725 [https://github.com/apache/spark/pull/13725] > Incorrectly merged AFTAggregator with zero total count > -- > > Key: SPARK-15892 > URL: https://issues.apache.org/jira/browse/SPARK-15892 > Project: Spark > Issue Type: Bug > Components: Examples, ML, PySpark >Affects Versions: 1.6.1, 2.0.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon > Fix For: 1.6.2 > > > Running the example (after the fix in > [https://github.com/apache/spark/pull/13393]) causes this failure: > {code} > Traceback (most recent call last): > > File > "/Users/josephkb/spark/examples/src/main/python/ml/aft_survival_regression.py", > line 49, in > model = aft.fit(training) > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/base.py", > line 64, in fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 213, in _fit > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", > line 210, in _fit_java > File > "/Users/josephkb/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/josephkb/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", > line 79, in deco > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: The number > of instances should be greater than 0.0, but got 0.' > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15603) Replace SQLContext with SparkSession in ML/MLLib
[ https://issues.apache.org/jira/browse/SPARK-15603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15603. --- Resolution: Fixed Fix Version/s: 2.0.0 > Replace SQLContext with SparkSession in ML/MLLib > > > Key: SPARK-15603 > URL: https://issues.apache.org/jira/browse/SPARK-15603 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun > Fix For: 2.0.0 > > > This issue replaces all deprecated `SQLContext` occurrences with > `SparkSession` in `ML/MLLib` module except the following two classes. These > two classes use `SQLContext` as their function arguments. > - ReadWrite.scala > - TreeModels.scala -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-16008) ML Logistic Regression aggregator serializes unnecessary data
[ https://issues.apache.org/jira/browse/SPARK-16008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-16008. --- Resolution: Fixed Fix Version/s: 2.0.0 Issue resolved by pull request 13729 [https://github.com/apache/spark/pull/13729] > ML Logistic Regression aggregator serializes unnecessary data > - > > Key: SPARK-16008 > URL: https://issues.apache.org/jira/browse/SPARK-16008 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Seth Hendrickson >Assignee: Seth Hendrickson > Fix For: 2.0.0 > > > LogisticRegressionAggregator class is used to collect gradient updates in ML > logistic regression algorithm. The class stores a reference to the > coefficients array of length equal to the number of features. It also stores > a reference to an array of standard deviations which is length numFeatures > also. When a task is completed it serializes the class which also serializes > a copy of the two arrays. These arrays don't need to be serialized (only the > gradient updates are being aggregated). This causes issues performance issues > when the number of features is large and can trigger excess garbage > collection when the executor doesn't have much excess memory. > This results in serializing 2*numFeatures excess data. When multiclass > logistic regression is implemented, the excess will be numFeatures + > numClasses * numFeatures. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Assignee: yuhao yang > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: yuhao yang > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336446#comment-15336446 ] Xiangrui Meng commented on SPARK-16000: --- That's great! Please let me know if you want to split the task into smaller ones. This is a little time sensitive because RC1 might come soon. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-15947. - > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15946) Wrap the conversion utils in Python
[ https://issues.apache.org/jira/browse/SPARK-15946?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-15946: - Assignee: Xiangrui Meng > Wrap the conversion utils in Python > --- > > Key: SPARK-15946 > URL: https://issues.apache.org/jira/browse/SPARK-15946 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > This is to wrap SPARK-15945 in Python. So Python users can use it to convert > DataFrames with vector columns. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Summary: Make pipeline components backward compatible with old vector columns in Scala/Java (was: Make pipeline components backward compatible with old vector columns) > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng closed SPARK-15948. - Resolution: Won't Fix > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15947 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng resolved SPARK-15947. --- Resolution: Won't Fix > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15948) Make pipeline components backward compatible with old vector columns in Python
[ https://issues.apache.org/jira/browse/SPARK-15948?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334732#comment-15334732 ] Xiangrui Meng commented on SPARK-15948: --- Marked this as "Won't Do". See SPARK-15947 for reasons. > Make pipeline components backward compatible with old vector columns in Python > -- > > Key: SPARK-15948 > URL: https://issues.apache.org/jira/browse/SPARK-15948 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > Same as SPARK-15947 but for Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15643: -- Assignee: Yanbo Liang > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15643) ML 2.0 QA: migration guide update
[ https://issues.apache.org/jira/browse/SPARK-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334729#comment-15334729 ] Xiangrui Meng commented on SPARK-15643: --- [~yanboliang] Please include a paragraph to help users convert vector columns. See https://issues.apache.org/jira/browse/SPARK-15947. > ML 2.0 QA: migration guide update > - > > Key: SPARK-15643 > URL: https://issues.apache.org/jira/browse/SPARK-15643 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML, MLlib >Reporter: Yanbo Liang >Priority: Blocker > > Update spark.ml and spark.mllib migration guide from 1.6 to 2.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334725#comment-15334725 ] Xiangrui Meng edited comment on SPARK-15947 at 6/16/16 9:30 PM: Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide (SPARK-15643) to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. was (Author: mengxr): Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15334725#comment-15334725 ] Xiangrui Meng commented on SPARK-15947: --- Had an offline discussion with [~josephkb]. There would be lot of work to implement this feature and tests. A simpler choice is to ask users to manually convert the DataFrames at the beginning of the pipeline with tools implemented in SPARK-15945. Then we can update migration guide to include the error message and put this workaround there. So users can search on Google and find the solution. I'm closing this ticket. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-16000 handles backward compatibility in model loading. was: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-15948 handles backward compatibility in model loading. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-16000 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Description: To help users migrate from Spark 1.6. to 2.0, we should make model loading backward compatible with models saved in 1.6. The main incompatibility is the vector column type change. > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > > To help users migrate from Spark 1.6. to 2.0, we should make model loading > backward compatible with models saved in 1.6. The main incompatibility is the > vector column type change. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Summary: Make model loading backward compatible with saved models using old vector columns (was: Make model loading backward compatible with saved models using old vector columns in Scala/Java) > Make model loading backward compatible with saved models using old vector > columns > - > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Description: After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. --Note that this includes loading old saved models.-- SPARK-15948 handles backward compatibility in model loading. was:After SPARK-15945, we should make ALL pipeline components accept old vector columns as input and do the conversion automatically (probably with a warning message), in order to smooth the migration to 2.0. Note that this includes loading old saved models. > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. > --Note that this includes loading old saved models.-- SPARK-15948 handles > backward compatibility in model loading. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15947) Make pipeline components backward compatible with old vector columns
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-15947: -- Summary: Make pipeline components backward compatible with old vector columns (was: Make pipeline components backward compatible with old vector columns in Scala/Java) > Make pipeline components backward compatible with old vector columns > > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this includes > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns
Xiangrui Meng created SPARK-16000: - Summary: Make model loading backward compatible with saved models using old vector columns Key: SPARK-16000 URL: https://issues.apache.org/jira/browse/SPARK-16000 Project: Spark Issue Type: Sub-task Components: ML, MLlib Reporter: Xiangrui Meng -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16000) Make model loading backward compatible with saved models using old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-16000?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng updated SPARK-16000: -- Summary: Make model loading backward compatible with saved models using old vector columns in Scala/Java (was: Make model loading backward compatible with saved models using old vector columns) > Make model loading backward compatible with saved models using old vector > columns in Scala/Java > --- > > Key: SPARK-16000 > URL: https://issues.apache.org/jira/browse/SPARK-16000 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Reporter: Xiangrui Meng > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15947) Make pipeline components backward compatible with old vector columns in Scala/Java
[ https://issues.apache.org/jira/browse/SPARK-15947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiangrui Meng reassigned SPARK-15947: - Assignee: Xiangrui Meng > Make pipeline components backward compatible with old vector columns in > Scala/Java > -- > > Key: SPARK-15947 > URL: https://issues.apache.org/jira/browse/SPARK-15947 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib >Affects Versions: 2.0.0 >Reporter: Xiangrui Meng >Assignee: Xiangrui Meng > > After SPARK-15945, we should make ALL pipeline components accept old vector > columns as input and do the conversion automatically (probably with a warning > message), in order to smooth the migration to 2.0. Note that this includes > loading old saved models. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org