[jira] [Created] (SPARK-29692) SparkContext.defaultParallism should reflect resource limits when resource limits are set
Bago Amirbekian created SPARK-29692: --- Summary: SparkContext.defaultParallism should reflect resource limits when resource limits are set Key: SPARK-29692 URL: https://issues.apache.org/jira/browse/SPARK-29692 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: Bago Amirbekian With the new gpu/fpga support in spark, defaultParallelism may not be computed correctly. Specifically defaultParaallelism may be much higher than the total possible concurrent tasks if workers have many more cores than gpus for example. Steps to reproduce: Start a cluster with spark.executor.resource.gpu.amount < cores per executor. Set spark.task.resource.gpu.amount = 1. Keep cores per task as 1. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-27446) RBackend always uses default values for spark confs
Bago Amirbekian created SPARK-27446: --- Summary: RBackend always uses default values for spark confs Key: SPARK-27446 URL: https://issues.apache.org/jira/browse/SPARK-27446 Project: Spark Issue Type: Bug Components: SparkR Affects Versions: 2.4.1 Reporter: Bago Amirbekian The RBackend and RBackendHandler create new conf objects that don't pick up conf values from the existing SparkSession and therefore always use the default conf values instead of values specified by the user. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext
[ https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672293#comment-16672293 ] Bago Amirbekian commented on SPARK-25921: - [~mengxr] [~jiangxb1987] Could you have a look. > Python worker reuse causes Barrier tasks to run without BarrierTaskContext > -- > > Key: SPARK-25921 > URL: https://issues.apache.org/jira/browse/SPARK-25921 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Major > > Running a barrier job after a normal spark job causes the barrier job to run > without a BarrierTaskContext. Here is some code to reproduce. > > {code:java} > def task(*args): > from pyspark import BarrierTaskContext > context = BarrierTaskContext.get() > context.barrier() > print("in barrier phase") > context.barrier() > return [] > a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() > assert a == [0, 1, 4, 9] > b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() > {code} > > Here is some of the trace > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Could > not recover from a failed barrier ResultStage. Most recent failure reason: > Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully. > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 372, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 367, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func > return f(iterator) > File "", line 4, in task > AttributeError: 'TaskContext' object has no attribute 'barrier' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext
[ https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-25921: Description: Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. Here is some code to reproduce. {code:java} def task(*args): from pyspark import BarrierTaskContext context = BarrierTaskContext.get() context.barrier() print("in barrier phase") context.barrier() return [] a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() assert a == [0, 1, 4, 9] b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() {code} Here is some of the trace {code:java} Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Could not recover from a failed barrier ResultStage. Most recent failure reason: Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully. org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 372, in main process() File "/databricks/spark/python/pyspark/worker.py", line 367, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func return f(iterator) File "", line 4, in task AttributeError: 'TaskContext' object has no attribute 'barrier' {code} was: Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. Here is some code to reproduce. {code:java} def task(*args): from pyspark import BarrierTaskContext context = BarrierTaskContext.get() context.barrier() print("in barrier phase") context.barrier() return [] a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() assert a == [0, 1, 4, 9] b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() {code} > Python worker reuse causes Barrier tasks to run without BarrierTaskContext > -- > > Key: SPARK-25921 > URL: https://issues.apache.org/jira/browse/SPARK-25921 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Major > > Running a barrier job after a normal spark job causes the barrier job to run > without a BarrierTaskContext. Here is some code to reproduce. > > {code:java} > def task(*args): > from pyspark import BarrierTaskContext > context = BarrierTaskContext.get() > context.barrier() > print("in barrier phase") > context.barrier() > return [] > a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() > assert a == [0, 1, 4, 9] > b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() > {code} > > Here is some of the trace > {code:java} > Py4JJavaError: An error occurred while calling > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > : org.apache.spark.SparkException: Job aborted due to stage failure: Could > not recover from a failed barrier ResultStage. Most recent failure reason: > Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully. > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File "/databricks/spark/python/pyspark/worker.py", line 372, in main > process() > File "/databricks/spark/python/pyspark/worker.py", line 367, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func > return f(iterator) > File "", line 4, in task > AttributeError: 'TaskContext' object has no attribute 'barrier' > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext
[ https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-25921: Description: Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. Here is some code to reproduce. {code:java} def task(*args): from pyspark import BarrierTaskContext context = BarrierTaskContext.get() context.barrier() print("in barrier phase") context.barrier() return [] a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() assert a == [0, 1, 4, 9] b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() {code} was: Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. Here is some code to reproduce. {code:java} def task(*args): from pyspark import BarrierTaskContext context = BarrierTaskContext.get() context.barrier() print("in barrier phase") context.barrier() return [] a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() assert a == [0, 1, 4, 9] b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() {code} > Python worker reuse causes Barrier tasks to run without BarrierTaskContext > -- > > Key: SPARK-25921 > URL: https://issues.apache.org/jira/browse/SPARK-25921 > Project: Spark > Issue Type: Bug > Components: PySpark, Spark Core >Affects Versions: 2.4.0 >Reporter: Bago Amirbekian >Priority: Major > > Running a barrier job after a normal spark job causes the barrier job to run > without a BarrierTaskContext. Here is some code to reproduce. > > {code:java} > def task(*args): > from pyspark import BarrierTaskContext > context = BarrierTaskContext.get() > context.barrier() > print("in barrier phase") > context.barrier() > return [] > a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() > assert a == [0, 1, 4, 9] > b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext
Bago Amirbekian created SPARK-25921: --- Summary: Python worker reuse causes Barrier tasks to run without BarrierTaskContext Key: SPARK-25921 URL: https://issues.apache.org/jira/browse/SPARK-25921 Project: Spark Issue Type: Bug Components: PySpark, Spark Core Affects Versions: 2.4.0 Reporter: Bago Amirbekian Running a barrier job after a normal spark job causes the barrier job to run without a BarrierTaskContext. Here is some code to reproduce. {code:java} def task(*args): from pyspark import BarrierTaskContext context = BarrierTaskContext.get() context.barrier() print("in barrier phase") context.barrier() return [] a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect() assert a == [0, 1, 4, 9] b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect() {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception
Bago Amirbekian created SPARK-25268: --- Summary: runParallelPersonalizedPageRank throws serialization Exception Key: SPARK-25268 URL: https://issues.apache.org/jira/browse/SPARK-25268 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.4.0 Reporter: Bago Amirbekian A recent change to PageRank introduced a bug in the ParallelPersonalizedPageRank implementation. The change prevents serialization of a Map which needs to be broadcast to all workers. The issue is in this line here: [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201] Because graphx units tests are run in local mode, the Serialization issue is not caught. {code:java} [info] - Star example parallel personalized PageRank *** FAILED *** (2 seconds, 160 milliseconds) [info] java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$2 [info] Serialization stack: [info] - object not serializable (class: scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> SparseVector(3)((2,1.0 [info] at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40) [info] at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46) [info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) [info] at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348) [info] at org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292) [info] at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127) [info] at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88) [info] at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34) [info] at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62) [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489) [info] at org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205) [info] at org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31) [info] at org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115) [info] at org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) [info] at org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51) [info] at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22) [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166) [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40) [info] at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175) [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413) [info] at org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401) [info] at scala.collection.immutable.List.foreach(List.scala:383) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483) [info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208) [info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555) [info] at org.scalatest.Suite$class.run(Suite.sca
[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt
[ https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-25149: Summary: Personalized Page Rank raises an error if vertexIDs are > MaxInt (was: ParallelPersonalizedPageRank raises an error if vertexIDs are > MaxInt) > Personalized Page Rank raises an error if vertexIDs are > MaxInt > > > Key: SPARK-25149 > URL: https://issues.apache.org/jira/browse/SPARK-25149 > Project: Spark > Issue Type: Bug > Components: GraphX >Affects Versions: 2.3.1 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Looking at the implementation I think we don't actually need this check. The > current implementation indexes the sparse vectors used and returned by the > method are index by the _position_ of the vertexId in `sources` not by the > vertex ID itself. We should remove this check and add a test to confirm the > implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25149) ParallelPersonalizedPageRank raises an error if vertexIDs are > MaxInt
Bago Amirbekian created SPARK-25149: --- Summary: ParallelPersonalizedPageRank raises an error if vertexIDs are > MaxInt Key: SPARK-25149 URL: https://issues.apache.org/jira/browse/SPARK-25149 Project: Spark Issue Type: Bug Components: GraphX Affects Versions: 2.3.1 Reporter: Bago Amirbekian Fix For: 2.4.0 Looking at the implementation I think we don't actually need this check. The current implementation indexes the sparse vectors used and returned by the method are index by the _position_ of the vertexId in `sources` not by the vertex ID itself. We should remove this check and add a test to confirm the implementation works with large vertex IDs. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.
Bago Amirbekian created SPARK-24852: --- Summary: Have spark.ml training use updated `Instrumentation` APIs. Key: SPARK-24852 URL: https://issues.apache.org/jira/browse/SPARK-24852 Project: Spark Issue Type: Story Components: ML Affects Versions: 2.4.0 Reporter: Bago Amirbekian Port spark.ml code to use the new methods on the `Instrumentation` class and remove the old methods & constructor. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible
Bago Amirbekian created SPARK-24747: --- Summary: Make spark.ml.util.Instrumentation class more flexible Key: SPARK-24747 URL: https://issues.apache.org/jira/browse/SPARK-24747 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.1 Reporter: Bago Amirbekian The Instrumentation class (which is an internal private class) is some what limited by it's current APIs. The class requires an estimator and dataset be passed to the constructor which limits how it can be used. Furthermore, the current APIs make it hard to intercept failures and record anything related to those failures. The following changes could make the instrumentation class easier to work with. All these changes are for private APIs and should not be visible to users. {code} // New no-argument constructor: Instrumentation() // New api to log previous constructor arguments. logTrainingContext(estimator: Estimator[_], dataset: Dataset[_]) logFailure(e: Throwable): Unit // Log success with no arguments logSuccess(): Unit // Log result model explicitly instead of passing to logSuccess logModel(model: Model[_]): Unit // On Companion object Instrumentation.instrumented[T](body: (Instrumentation => T)): T // The above API will allow us to write instrumented methods more clearly and handle logging success and failure automatically: def someMethod(...): T = instrumented { instr => instr.logNamedValue(name, value) // more code here instr.logModel(model) } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
Bago Amirbekian created SPARK-23686: --- Summary: Make better usage of org.apache.spark.ml.util.Instrumentation Key: SPARK-23686 URL: https://issues.apache.org/jira/browse/SPARK-23686 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian This Jira is a bit high level and might require subtasks or other jiras for more specific tasks. I've noticed that we don't make the best usage of the instrumentation class. Specifically sometimes we bypass the instrumentation class and use the debugger instead. For example, [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] Also there are some things that might be useful to log in the instrumentation class that we currently don't. For example: number of training examples mean/var of label (regression) I know computing these things can be expensive in some cases, but especially when this data is already available we can log it for free. For example, Logistic Regression Summarizer computes some useful data including numRows that we don't log. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
Bago Amirbekian created SPARK-23562: --- Summary: RFormula handleInvalid should handle invalid values in non-string columns. Key: SPARK-23562 URL: https://issues.apache.org/jira/browse/SPARK-23562 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian Currently when handleInvalid is set to 'keep' or 'skip' this only applies to String fields. Numeric fields that are null will either cause the transformer to fail or might be null in the resulting label column. I'm not sure what the semantics of keep might be for numeric columns with null values, but we should be able to at least support skip for these types. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296 ] Bago Amirbekian edited comment on SPARK-2 at 2/27/18 9:24 PM: -- [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala#L42 was (Author: bago.amirbekian): [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame
[ https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296 ] Bago Amirbekian commented on SPARK-2: - [~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include `numAttributes` in the dataframe column metadata and avoid the call to `first`. https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala > SparkML VectorAssembler.transform slow when needing to invoke .first() on > sorted DataFrame > -- > > Key: SPARK-2 > URL: https://issues.apache.org/jira/browse/SPARK-2 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib, SQL >Affects Versions: 2.2.1 >Reporter: V Luong >Priority: Minor > > Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes > oldDF.first() in order to establish some metadata/attributes: > [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.] > When oldDF is sorted, the above triggering of oldDF.first() can be very slow. > For the purpose of establishing metadata, taking an arbitrary row from oldDF > will be just as good as taking oldDF.first(). Is there hence a way we can > speed up a great deal by somehow grabbing a random row, instead of relying on > oldDF.first()? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels
[ https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379282#comment-16379282 ] Bago Amirbekian commented on SPARK-19947: - I think this was resolved by [https://github.com/apache/spark/pull/18496] &; [https://github.com/apache/spark/pull/18613]. > RFormulaModel always throws Exception on transforming data with NULL or > Unseen labels > - > > Key: SPARK-19947 > URL: https://issues.apache.org/jira/browse/SPARK-19947 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Andrey Yatsuk >Priority: Major > > I have trained ML model and big data table in parquet. I want add new column > to this table with predicted values. I can't lose any data, but can having > null values in it. > RFormulaModel.fit() method creates new StringIndexer with default > (handleInvalid="error") parameter. Also VectorAssembler on NULL values > throwing Exception. So I must call df.na.drop() to transform this DataFrame > and I don't want to do this. > Need add to RFormula new parameter like handleInvalid in StringIndexer. > Or add transform(Seq): Vector method which user can use as UDF method > in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, > Seq)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:07 PM: -- [~Keepun], `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? was (Author: bago.amirbekian): [~Keepun] {noformat} train {noformat} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM: -- [~Keepun] {code:java} train {code} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM: -- [~Keepun] {noformat} train {noformat} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {code:java} train {code} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM: -- [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit|? was (Author: bago.amirbekian): [~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM: -- [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit}? was (Author: bago.amirbekian): [~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use {RandomForestClassifier.fit|? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata
[ https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202 ] Bago Amirbekian commented on SPARK-23471: - [~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also copies the values of Params to the newly created Model instance, [here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118]. Do you get this same issue if you use `RandomForestClassifier.fit`? > RandomForestClassificationModel save() - incorrect metadata > --- > > Key: SPARK-23471 > URL: https://issues.apache.org/jira/browse/SPARK-23471 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.1 >Reporter: Keepun >Priority: Major > > RandomForestClassificationMode.load() does not work after save() > {code:java} > RandomForestClassifier rf = new RandomForestClassifier() > .setFeaturesCol("features") > .setLabelCol("result") > .setNumTrees(100) > .setMaxDepth(30) > .setMinInstancesPerNode(1) > //.setCacheNodeIds(true) > .setMaxMemoryInMB(500) > .setSeed(System.currentTimeMillis() + System.nanoTime()); > RandomForestClassificationModel rfmodel = rf.train(data); >try { > rfmodel.save(args[2] + "." + System.currentTimeMillis()); >} catch (IOException e) { > LOG.error(e.getMessage(), e); > e.printStackTrace(); >} > {code} > File metadata\part-0: > {code:java} > {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel", > "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488", > "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini", > "checkpointInterval":10, > "numTrees":20,"maxDepth":5, > "probabilityCol":"probability","labelCol":"label","featuresCol":"features", > "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0, > "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32, > "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2, > "numTrees":20} > {code} > should be: > {code:java} > "numTrees":100,"maxDepth":30,{code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366277#comment-16366277 ] Bago Amirbekian commented on SPARK-23265: - What's the status of this? Will this be a change in behaviour? > Update multi-column error handling logic in QuantileDiscretizer > --- > > Key: SPARK-23265 > URL: https://issues.apache.org/jira/browse/SPARK-23265 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Nick Pentreath >Priority: Major > > SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If > both single- and mulit-column params are set (specifically {{inputCol}} / > {{inputCols}}) an error is thrown. > However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. > The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that > for this transformer, it is acceptable to set the single-column param for > {{numBuckets }}when transforming multiple columns, since that is then applied > to all columns. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug
[ https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-23377: Description: A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} And the trace: {code:java} java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the inputCols Param set for multi-column transform. The following Params are not applicable and should not be set: outputCol. at org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) at org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) at org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) at org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) at line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) {code} was: A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} > Bucketizer with multiple columns persistence bug > > > Key: SPARK-23377 > URL: https://issues.apache.org/jira/browse/SPARK-23377 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > A Bucketizer with multiple input/output columns get "inputCol" set to the > default value on write -> read which causes it to throw an error on > transform. Here's an example. > {code:java} > import org.apache.spark.ml.feature._ > val splits = Array(Double.NegativeInfinity, 0, 10, 100, > Double.PositiveInfinity) > val bucketizer = new Bucketizer() > .setSplitsArray(Array(splits, splits)) > .setInputCols(Array("foo1", "foo2")) > .setOutputCols(Array("bar1", "bar2")) > val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") > bucketizer.transform(data) > val path = "/temp/bucketrizer-persist-test" > bucketizer.write.overwrite.save(path) > val bucketizerAfterRead = Bucketizer.read.load(path) > println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) > // This line throws an error because "outputCol" is set > bucketizerAfterRead.transform(data) > {code} > And the trace: > {code:java} > java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has > the inputCols Param set for multi-column transform. The following Params are > not applicable and should not be set: outputCol. > at > org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300) > at > org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314) > at > org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189) > at > org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141) > at > line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23377) Bucketizer with multiple columns persistence bug
Bago Amirbekian created SPARK-23377: --- Summary: Bucketizer with multiple columns persistence bug Key: SPARK-23377 URL: https://issues.apache.org/jira/browse/SPARK-23377 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian A Bucketizer with multiple input/output columns get "inputCol" set to the default value on write -> read which causes it to throw an error on transform. Here's an example. {code:java} import org.apache.spark.ml.feature._ val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity) val bucketizer = new Bucketizer() .setSplitsArray(Array(splits, splits)) .setInputCols(Array("foo1", "foo2")) .setOutputCols(Array("bar1", "bar2")) val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2") bucketizer.transform(data) val path = "/temp/bucketrizer-persist-test" bucketizer.write.overwrite.save(path) val bucketizerAfterRead = Bucketizer.read.load(path) println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol)) // This line throws an error because "outputCol" is set bucketizerAfterRead.transform(data) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340448#comment-16340448 ] Bago Amirbekian commented on SPARK-23105: - [~mlnick] We can update the sub tasks to target 2.3 if you think it's appropriate. I don't know how involved Joseph can be for this release so we might need another committer to shepard these tasks, I can take on some of it. > Spark MLlib, GraphX 2.3 QA umbrella > --- > > Key: SPARK-23105 > URL: https://issues.apache.org/jira/browse/SPARK-23105 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: SPARK-23114.* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340399#comment-16340399 ] Bago Amirbekian commented on SPARK-23109: - [~bryanc] One reason the python API might be different is because in python we can use `imageRow.height` in place of `getHeight(imageRow)` so the getters don't add much value. Also, `toNDArray` doesn't make sense in Scala. I think we should add `columnSchema` to the python API, but it doesn't need to be block the release IMHO. > ML 2.3 QA: API: Python API coverage > --- > > Key: SPARK-23109 > URL: https://issues.apache.org/jira/browse/SPARK-23109 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, PySpark >Reporter: Joseph K. Bradley >Priority: Blocker > > For new public APIs added to MLlib ({{spark.ml}} only), we need to check the > generated HTML doc and compare the Scala & Python versions. > * *GOAL*: Audit and create JIRAs to fix in the next release. > * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues. > We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > *Please use a _separate_ JIRA (linked below as "requires") for this list of > to-do items.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234 ] Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM: --- I ran mina in branch-2.3 and got the following output: {code} [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) {code} I don't think I can assign this task to myself or change its status. was (Author: bago.amirbekian): I ran mina in branch-2.3 and got the following output: {code} [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) {code} I don't think I can assign this task to myself or change its status. > ML, Graph 2.3 QA: API: Binary incompatible changes > -- > > Key: SPARK-23106 > URL: https://issues.apache.org/jira/browse/SPARK-23106 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian resolved SPARK-23106. - Resolution: Resolved > ML, Graph 2.3 QA: API: Binary incompatible changes > -- > > Key: SPARK-23106 > URL: https://issues.apache.org/jira/browse/SPARK-23106 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234 ] Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM: --- I ran mina in branch-2.3 and got the following output: {code} [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) {code} I don't think I can assign this task to myself or change its status. was (Author: bago.amirbekian): I ran mina in branch-2.3 and got the following output: {{ [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) }} I don't think I can assign this task to myself or change its status. > ML, Graph 2.3 QA: API: Binary incompatible changes > -- > > Key: SPARK-23106 > URL: https://issues.apache.org/jira/browse/SPARK-23106 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234 ] Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM: --- I ran mina in branch-2.3 and got the following output: {code} [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) {code} I don't think I can assign this task to myself. was (Author: bago.amirbekian): I ran mina in branch-2.3 and got the following output: {code} [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) {code} I don't think I can assign this task to myself or change its status. > ML, Graph 2.3 QA: API: Binary incompatible changes > -- > > Key: SPARK-23106 > URL: https://issues.apache.org/jira/browse/SPARK-23106 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes
[ https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234 ] Bago Amirbekian commented on SPARK-23106: - I ran mina in branch-2.3 and got the following output: {{ [info] spark-graphx: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1) [info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 (filtered 3) [info] spark-catalyst: previous-artifact not set, not analyzing binary compatibility [info] spark-streaming: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25) [info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary compatibility [info] spark-hive: previous-artifact not set, not analyzing binary compatibility [info] spark-repl: previous-artifact not set, not analyzing binary compatibility [info] spark-assembly: previous-artifact not set, not analyzing binary compatibility [info] spark-examples: previous-artifact not set, not analyzing binary compatibility [info] spark-core: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112) [info] spark-mllib: found 0 potential binary incompatibilities while checking against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143) }} I don't think I can assign this task to myself or change its status. > ML, Graph 2.3 QA: API: Binary incompatible changes > -- > > Key: SPARK-23106 > URL: https://issues.apache.org/jira/browse/SPARK-23106 > Project: Spark > Issue Type: Sub-task > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Priority: Blocker > > Generate a list of binary incompatible changes using MiMa and create new > JIRAs for issues found. Filter out false positives as needed. > If you want to take this task, look at the analogous task from the previous > release QA, and ping the Assignee for advice. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator
Bago Amirbekian created SPARK-23048: --- Summary: Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator Key: SPARK-23048 URL: https://issues.apache.org/jira/browse/SPARK-23048 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian Since we're deprecating OneHotEncoder, we should update the docs to reference it's replacement, OneHotEncoderEstimator. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23046) Have RFormula include VectorSizeHint in pipeline
Bago Amirbekian created SPARK-23046: --- Summary: Have RFormula include VectorSizeHint in pipeline Key: SPARK-23046 URL: https://issues.apache.org/jira/browse/SPARK-23046 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23045) Have RFormula use OneHoEncoderEstimator
[ https://issues.apache.org/jira/browse/SPARK-23045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-23045: Summary: Have RFormula use OneHoEncoderEstimator (was: Have RFormula use OneHotEstimator) > Have RFormula use OneHoEncoderEstimator > --- > > Key: SPARK-23045 > URL: https://issues.apache.org/jira/browse/SPARK-23045 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
[ https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-23037: Affects Version/s: (was: 2.2.0) 2.3.0 > RFormula should not use deprecated OneHotEncoder and should include > VectorSizeHint in pipeline > -- > > Key: SPARK-23037 > URL: https://issues.apache.org/jira/browse/SPARK-23037 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian > -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23045) Have RFormula use OneHotEstimator
Bago Amirbekian created SPARK-23045: --- Summary: Have RFormula use OneHotEstimator Key: SPARK-23045 URL: https://issues.apache.org/jira/browse/SPARK-23045 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline
Bago Amirbekian created SPARK-23037: --- Summary: RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline Key: SPARK-23037 URL: https://issues.apache.org/jira/browse/SPARK-23037 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.2.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313958#comment-16313958 ] Bago Amirbekian edited comment on SPARK-22126 at 1/5/18 9:48 PM: - > Do you think it's possible to put this kind of execution in fitMultiple and > allow CV to parallelize work for the stages? Yes, absolutely. The iterator can maintain a queue of tasks. Each call to `next` will pick a task off the queue, optionally add more tasks to the queue and return a single model instance. Since models can be returned in any order, the tasks can be organized however is needed to optimally finish the work. If the queue is empty (but the iterator isn't finished), `next` can simply wait for a previous task to finish and put more tasks on the queue. The iterator approach is very flexible. > With my PR, I would do this by having the Pipeline estimator return all > params in getOptimizedParams. The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed the order that you want the models returned. For my purposes I don't see much of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce the models in arbitrary order. was (Author: bago.amirbekian): > Do you think it's possible to put this kind of execution in fitMultiple and > allow CV to parallelize work for the stages? Yes, absolutely. The iterator can maintain a queue of tasks. Each call to `next` will pick a task off the queue, optionally add more tasks to the queue and return a single model instance. Since models can be returned in any order, the tasks can be organized however is needed to optimally finish the work. If the queue is empty, `next` can simply wait for a previous task to finish and put more tasks on the queue. The iterator approach is very flexible. > With my PR, I would do this by having the Pipeline estimator return all > params in getOptimizedParams. The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed the order that you want the models returned. For my purposes I don't see much of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce the models in arbitrary order. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example a
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313958#comment-16313958 ] Bago Amirbekian commented on SPARK-22126: - > Do you think it's possible to put this kind of execution in fitMultiple and > allow CV to parallelize work for the stages? Yes, absolutely. The iterator can maintain a queue of tasks. Each call to `next` will pick a task off the queue, optionally add more tasks to the queue and return a single model instance. Since models can be returned in any order, the tasks can be organized however is needed to optimally finish the work. If the queue is empty, `next` can simply wait for a previous task to finish and put more tasks on the queue. The iterator approach is very flexible. > With my PR, I would do this by having the Pipeline estimator return all > params in getOptimizedParams. The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed the order that you want the models returned. For my purposes I don't see much of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce the models in arbitrary order. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > /
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312264#comment-16312264 ] Bago Amirbekian commented on SPARK-22126: - [~bryanc] thanks for taking the time to put together the PR and share thoughts. I like the idea of being able to preserve the existing APIs and not needing to add a new fitMultiple API but I'm concerned the existing APIs aren't quite flexible enough. One of the use cases that motivated the {{ fitMultiple }} API was optimizing the Pipeline Estimator. The Pipeline Estimator seems like in important one to optimize because I believe it's required in order for CrossValidator to be able to exploit optimized implementations of the {{ fit }}/{{ fitMultiple }} methods of Pipeline stages. The way one would optimize the Pipeline Estimator is to group the paramMaps into a tree structure where each level represents a stage with a param that can take multiple values. One would then traverse the tree in depth first order. Notice that in this case the params need not be estimator params, but could actually be transformer params as well since we can avoid applying expensive transformers multiple times. With this approach all the params for a pipeline estimator after the top level of the tree are "optimizable" so simply being group on optimizable params isn't sufficient, we need to actually order the paramMaps to match the depth first traversal of the stages tree. I'm still thinking through all this in my head so let me know if any of it is off base or not clear, but I think the advantage of the {{ fitMultiple }} approach gives us full flexibility in order to these kinds of optimizations. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > Anyone who's following might want to scan the design doc (in the links > above), the latest api proposal is: > {code} > def fitMultiple( > dataset: Dataset[_], > paramMaps: Array[ParamMap] > ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] > {code} > Old discussion: > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the
[jira] [Created] (SPARK-22949) Reduce memory requirement for TrainValidationSplit
Bago Amirbekian created SPARK-22949: --- Summary: Reduce memory requirement for TrainValidationSplit Key: SPARK-22949 URL: https://issues.apache.org/jira/browse/SPARK-22949 Project: Spark Issue Type: Bug Components: ML Affects Versions: 2.3.0 Reporter: Bago Amirbekian Priority: Critical There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, the same patch to be applied to {{ TrainValidationSplit }}. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22922) Python API for fitMultiple
Bago Amirbekian created SPARK-22922: --- Summary: Python API for fitMultiple Key: SPARK-22922 URL: https://issues.apache.org/jira/browse/SPARK-22922 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Bago Amirbekian Implement fitMultiple method on Estimator for pyspark. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295964#comment-16295964 ] Bago Amirbekian edited comment on SPARK-22126 at 12/19/17 12:55 AM: Anyone who's following might want to scan the design doc (in the links above), the latest api proposal is: {code} def fitMultiple( dataset: Dataset[_], paramMaps: Array[ParamMap] ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] {code} was (Author: bago.amirbekian): Anyone who's following might want to scan the design doc, the latest api proposal is: {code} def fitMultiple( dataset: Dataset[_], paramMaps: Array[ParamMap] ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] {code} > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { mode
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295964#comment-16295964 ] Bago Amirbekian commented on SPARK-22126: - Anyone who's following might want to scan the design doc, the latest api proposal is: {code} def fitMultiple( dataset: Dataset[_], paramMaps: Array[ParamMap] ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]] {code} > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { modelMap => > modelMap.map { case (index: Int, model: Model[_]) => > val metric = eval.evaluate(model.transform(validationDataset, > paramMaps(index))) > (index, metric) > } > } (executionContext) > } > // Wait for metrics to be calculated before unpersisting validation > dataset >
[jira] [Updated] (SPARK-22811) pyspark.ml.tests is missing a py4j import.
[ https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22811: Priority: Minor (was: Major) > pyspark.ml.tests is missing a py4j import. > -- > > Key: SPARK-22811 > URL: https://issues.apache.org/jira/browse/SPARK-22811 > Project: Spark > Issue Type: Bug > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Minor > > This bug isn't getting caught because the relevant code only gets run if the > test environment does not have Hive. > https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22811) pyspark.ml.tests is missing a py4j import.
Bago Amirbekian created SPARK-22811: --- Summary: pyspark.ml.tests is missing a py4j import. Key: SPARK-22811 URL: https://issues.apache.org/jira/browse/SPARK-22811 Project: Spark Issue Type: Bug Components: ML, PySpark Affects Versions: 2.3.0 Reporter: Bago Amirbekian This bug isn't getting caught because the relevant code only gets run if the test environment does not have Hive. https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283145#comment-16283145 ] Bago Amirbekian commented on SPARK-22126: - Joseph, the way I read your comment is to say that we should support parallelism & optimized model but not parallelism with optimized models. I think that would cover our current use cases, but I'm wondering if we want to leave open the possibility of optimizing parameters like maxIter & maxDepth and have those optimized implements play nice with parallelism in CrossValidator. I normally believe in doing the simple thing first and then changing it if needed, but it would requiring adding another public API later. > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 > I copy discussion from gist to here: > I propose to design API as: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] > {code} > Let me use an example to explain the API: > {quote} > It could be possible to still use the current parallelism and still allow > for model-specific optimizations. For example, if we doing cross validation > and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets > say that the cross validator could know that maxIter is optimized for the > model being evaluated (e.g. a new method in Estimator that return such > params). It would then be straightforward for the cross validator to remove > maxIter from the param map that will be parallelized over and use it to > create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, > maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)). > {quote} > In this example, we can see that, models computed from ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread > code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, > maxIter=10)) in another thread. In this example, there're 4 paramMaps, but > we can at most generate two threads to compute the models for them. > The API above allow "callable.call()" to return multiple models, and return > type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap > index for corresponding model. Use the example above, there're 4 paramMaps, > but only return 2 callable objects, one callable object for ((regParam=0.1, > maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, > maxIter=5), (regParam=0.3, maxIter=10)). > and the default "fitCallables/fit with paramMaps" can be implemented as > following: > {code} > def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): > Array[Callable[Map[Int, M]]] = { > paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) => > new Callable[Map[Int, M]] { > override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap)) > } > } > } > def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = { >fitCallables(dataset, paramMaps).map { _.call().toSeq } > .flatMap(_).sortBy(_._1).map(_._2) > } > {code} > If use the API I proposed above, the code in > [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159] > can be changed to: > {code} > val trainingDataset = sparkSession.createDataFrame(training, > schema).cache() > val validationDataset = sparkSession.createDataFrame(validation, > schema).cache() > // Fit models in a Future for training in parallel > val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { > callable => > Future[Map[Int, Model[_]]] { > val modelMap = callable.call() > if (collectSubModelsParam) { >... > } > modelMap > } (executionContext) > } > // Unpersist training data only when all models have trained > Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, > executionContext) > .onComplete { _ => trainingDataset.unpersist() } (executionContext) > // Evaluate models in a Future that will calulate a metric and allow > model to be cleaned up > val foldMetricMapFutures = modelMapFutures.map { modelMapFuture => > modelMapFuture.map { modelMap => > modelMap.map { case (index: Int, model: Mo
[jira] [Created] (SPARK-22734) Create Python API for VectorSizeHint
Bago Amirbekian created SPARK-22734: --- Summary: Create Python API for VectorSizeHint Key: SPARK-22734 URL: https://issues.apache.org/jira/browse/SPARK-22734 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.2.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22735) Add VectorSizeHint to ML features documentation
Bago Amirbekian created SPARK-22735: --- Summary: Add VectorSizeHint to ML features documentation Key: SPARK-22735 URL: https://issues.apache.org/jira/browse/SPARK-22735 Project: Spark Issue Type: Improvement Components: ML Affects Versions: 2.2.0 Reporter: Bago Amirbekian -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279537#comment-16279537 ] Bago Amirbekian edited comment on SPARK-22126 at 12/6/17 2:19 AM: -- [~WeichenXu123] Sorry I misunderstood, I thought you wanted to use java.concurrent.Callable. Making our own trait makes sense. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html was (Author: bago.amirbekian): [~WeichenXu123] Sorry I misunderstood, I thought you wanted to use java.concurrent.Callable. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279537#comment-16279537 ] Bago Amirbekian commented on SPARK-22126: - [~WeichenXu123] Sorry I misunderstood, I thought you wanted to use java.concurrent.Callable. https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218 ] Bago Amirbekian edited comment on SPARK-22126 at 12/5/17 9:58 PM: -- I started a discussion about potential to this issue on this [gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm going to summarize the gist here here and encourage further discussion to take place on this JIRA to increase visibility of the discussion. I proposed that we add a new method to the {{Estimator}} interface {{fitMultiple(dataset, paramMaps): Array[Callable[Model]]}}. The purpose of this method is to allow estimators to implement model specific optimizations for fitting each model with multiple paramMaps. This API will also be use by {{CrossValidator}} and other meta transformers when fitting multiple models in parallel. [~WeichenXu123] suggested modifying the API to {{fitMultiple(dataset: Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]}}. The reasoning is that allowing each callable to return multiple models will make it easier to efficiently schedule these tasks in parallel (eg we will avoid scheduling A and B where B simply waits on A). was (Author: bago.amirbekian): I started a discussion about potential to this issue on this [gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm going to summarize the gist here here and encourage further discussion to take place on this JIRA to increase visibility of the discussion. I proposed that we add a new method to the `Estimator` interface `fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this method is to allow estimators to implement model specific optimizations for fitting each model with multiple paramMaps. This API will also be use by `CrossValidator` and other meta transformers when fitting multiple models in parallel. [~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The reasoning is that allowing each callable to return multiple models will make it easier to efficiently schedule these tasks in parallel (eg we will avoid scheduling A and B where B simply waits on A). > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218 ] Bago Amirbekian edited comment on SPARK-22126 at 12/5/17 9:53 PM: -- I started a discussion about potential to this issue on this [gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm going to summarize the gist here here and encourage further discussion to take place on this JIRA to increase visibility of the discussion. I proposed that we add a new method to the `Estimator` interface `fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this method is to allow estimators to implement model specific optimizations for fitting each model with multiple paramMaps. This API will also be use by `CrossValidator` and other meta transformers when fitting multiple models in parallel. [~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The reasoning is that allowing each callable to return multiple models will make it easier to efficiently schedule these tasks in parallel (eg we will avoid scheduling A and B where B simply waits on A). was (Author: bago.amirbekian): I started a discussion about potential to this issue on this gist. I'm going to summarize the gist here here and encourage further discussion to take place on this JIRA to increase visibility of the discussion. I proposed that we add a new method to the `Estimator` interface `fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this method is to allow estimators to implement model specific optimizations for fitting each model with multiple paramMaps. This API will also be use by `CrossValidator` and other meta transformers when fitting multiple models in parallel. [~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The reasoning is that allowing each callable to return multiple models will make it easier to efficiently schedule these tasks in parallel (eg we will avoid scheduling A and B where B simply waits on A). > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning
[ https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218 ] Bago Amirbekian commented on SPARK-22126: - I started a discussion about potential to this issue on this gist. I'm going to summarize the gist here here and encourage further discussion to take place on this JIRA to increase visibility of the discussion. I proposed that we add a new method to the `Estimator` interface `fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this method is to allow estimators to implement model specific optimizations for fitting each model with multiple paramMaps. This API will also be use by `CrossValidator` and other meta transformers when fitting multiple models in parallel. [~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The reasoning is that allowing each callable to return multiple models will make it easier to efficiently schedule these tasks in parallel (eg we will avoid scheduling A and B where B simply waits on A). > Fix model-specific optimization support for ML tuning > - > > Key: SPARK-22126 > URL: https://issues.apache.org/jira/browse/SPARK-22126 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu > > Fix model-specific optimization support for ML tuning. This is discussed in > SPARK-19357 > more discussion is here > https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0 -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261393#comment-16261393 ] Bago Amirbekian commented on SPARK-20586: - Also a follow up questions, are the performance implications to using the `deterministic` flag we should try and avoid by restructuring the ml UDFs to avoid raising within the UDF. > Add deterministic to ScalaUDF > - > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF
[ https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261377#comment-16261377 ] Bago Amirbekian commented on SPARK-20586: - Is there some documentation somewhere about the right way to use this `deterministic` flag? I bring it up because in `spark.ml` sometimes we will raise errors in a udf when a dataFrame contains invalid data. This can cause bad behavior if the optimizer re-orders the udf with other operations so we’re marking these udfs as `nonDeterministic` (https://github.com/apache/spark/pull/19662) but that somehow seems wrong. The issue isn’t that the udfs are non-deterministic, they are deterministic and always raise on the same inputs. I guess my question is 1) Is this correct usage of the non-deterministic flag or are we simply abusing it when we should come up with a more specific solution? 2) If this is correct usage could we rename the flag or document this type of usage somewhere? > Add deterministic to ScalaUDF > - > > Key: SPARK-20586 > URL: https://issues.apache.org/jira/browse/SPARK-20586 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Xiao Li >Assignee: Xiao Li > Fix For: 2.3.0 > > > https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html > Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF > too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we > only add the following two flags. > - deterministic: Certain optimizations should not be applied if UDF is not > deterministic. Deterministic UDF returns same result each time it is invoked > with a particular input. This determinism just needs to hold within the > context of a query. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248049#comment-16248049 ] Bago Amirbekian commented on SPARK-22346: - I think [~josephkb]'s version of Option 3 makes the most sense. A transformer that adds size data to a vector column would allow patching pipelines pretty easily and it could be implemented without breaking any APIs. I'm currently working on a PR based on this approach. > Update VectorAssembler to work with Structured Streaming > > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Priority: Critical > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219109#comment-16219109 ] Bago Amirbekian commented on SPARK-22346: - Nick I see that options as a stepping stone to option 2 above (drop metadata for vector columns). If we create this param, we're in essence saying to downstream transformers, "don't rely on the availability of metadata". And if that's the case, metadata should be considered an optimization, and the presence/absence of metadata should never change the result. > Update VectorAssembler to work with StreamingDataframes > --- > > Key: SPARK-22346 > URL: https://issues.apache.org/jira/browse/SPARK-22346 > Project: Spark > Issue Type: Improvement > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian >Priority: Critical > > The issue > In batch mode, VectorAssembler can take multiple columns of VectorType and > assemble a output a new column of VectorType containing the concatenated > vectors. In streaming mode, this transformation can fail because > VectorAssembler does not have enough information to produce metadata > (AttributeGroup) for the new column. Because VectorAssembler is such a > ubiquitous part of mllib pipelines, this issue effectively means spark > structured streaming does not support prediction using mllib pipelines. > I've created this ticket so we can discuss ways to potentially improve > VectorAssembler. Please let me know if there are any issues I have not > considered or potential fixes I haven't outlined. I'm happy to submit a patch > once I know which strategy is the best approach. > Potential fixes > 1) Replace VectorAssembler with an estimator/model pair like was recently > done with OneHotEncoder, > [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The > Estimator can "learn" the size of the inputs vectors during training and save > it to use during prediction. > Pros: > * Possibly simplest of the potential fixes > Cons: > * We'll need to deprecate current VectorAssembler > 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty > major change, but it could be done in stages. We could first ensure that > metadata is not used during prediction and allow the VectorAssembler to drop > metadata for streaming dataframes. Going forward, it would be important to > not use any metadata on Vector columns for any prediction tasks. > Pros: > * Potentially, easy short term fix for VectorAssembler > (drop metadata for vector columns in streaming). > * Current Attributes implementation is also causing other issues, eg > [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. > Cons: > * To fully remove ML Attributes would be a major refactor of MLlib and would > most likely require breaking changings. > * A partial removal of ML attributes (eg: ensure ML attributes are not used > during transform, only during fit) might be tricky. This would require > testing or other enforcement mechanism to prevent regressions. > 3) Require Vector columns to have fixed length vectors. Most mllib > transformers that produce vectors already include the size of the vector in > the column metadata. This change would be to deprecate APIs that allow > creating a vector column of unknown length and replace those APIs with > equivalents that enforce a fixed size. > Pros: > * We already treat vectors as fixed size, for example VectorAssembler assumes > the inputs * output col are fixed size vectors and creates metadata > accordingly. In the spirit of explicit is better than implicit, we would be > codifying something we already assume. > * This could potentially enable performance optimizations that are only > possible if the Vector size of a column is fixed & known. > Cons: > * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler (drop metadata for vector columns in streaming). * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely r
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-21926: Description: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here SPARK-22346. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. was: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here [Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346]. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here SPARK-22346. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-21926: Description: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here SPARK-22346. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). -b) Allow user to set the cardinality of OneHotEncoder.- was: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here SPARK-22346. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here SPARK-22346. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > -b) Allow user to set the cardinality of OneHotEncoder.- -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-21926: Description: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here Spark-22346. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. was: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: I've created a jira to track this 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here Spark-22346. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-21926: Description: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: I've created a jira to track this 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. was: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: a) Re-design vectorUDT metadata to support missing metadata for some elements. (This might be a good thing to do anyways SPARK-19141) b) drop metadata in streaming context. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > I've created a jira to track this > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-21926: Description: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here [Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346]. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. was: We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: More details here Spark-22346. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. > Compatibility between ML Transformers and Structured Streaming > -- > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Umbrella > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > More details here > [Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346]. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used du
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [#SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechani
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, eg [#SPARK-19141]. Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, . Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler * Current Attributes implementation is also causing other issues, . Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the siz
[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
[ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22346: Description: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. was: The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to depreca
[jira] [Created] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes
Bago Amirbekian created SPARK-22346: --- Summary: Update VectorAssembler to work with StreamingDataframes Key: SPARK-22346 URL: https://issues.apache.org/jira/browse/SPARK-22346 Project: Spark Issue Type: Improvement Components: ML, Structured Streaming Affects Versions: 2.2.0 Reporter: Bago Amirbekian Priority: Critical The issue In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines. I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach. Potential fixes 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction. Pros: * Possibly simplest of the potential fixes Cons: * We'll need to deprecate current VectorAssembler 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks. Pros: * Potentially, easy short term fix for VectorAssembler Cons: * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings. * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions. 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size. Pros: * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume. * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known. Cons: * This would require breaking changes. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Component/s: SQL > Row objects in pyspark created using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Summary: Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly (was: Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly) > Row objects in pyspark created using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198068#comment-16198068 ] Bago Amirbekian commented on SPARK-22232: - Full trace: {code:none} [Row(a=u'a', c=3.0, b=2), Row(a=u'a', c=3.0, b=2)] --- Py4JJavaError Traceback (most recent call last) in () 17 18 # If we introduce a shuffle we have issues ---> 19 print rdd.repartition(3).toDF(schema).take(2) /databricks/spark/python/pyspark/sql/dataframe.pyc in take(self, num) 475 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')] 476 """ --> 477 return self.limit(num).collect() 478 479 @since(1.3) /databricks/spark/python/pyspark/sql/dataframe.pyc in collect(self) 437 """ 438 with SCCallSiteSync(self._sc) as css: --> 439 port = self._jdf.collectToPython() 440 return list(_load_from_socket(port, BatchedSerializer(PickleSerializer( 441 /databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in __call__(self, *args) 1131 answer = self.gateway_client.send_command(command) 1132 return_value = get_return_value( -> 1133 answer, self.gateway_client, self.target_id, self.name) 1134 1135 for temp_arg in temp_args: /databricks/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw) 61 def deco(*a, **kw): 62 try: ---> 63 return f(*a, **kw) 64 except py4j.protocol.Py4JJavaError as e: 65 s = e.java_exception.toString() /databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 317 raise Py4JJavaError( 318 "An error occurred while calling {0}{1}{2}.\n". --> 319 format(target_id, ".", name), value) 320 else: 321 raise Py4JError( Py4JJavaError: An error occurred while calling o204.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 161.0 failed 4 times, most recent failure: Lost task 0.3 in stage 161.0 (TID 433, 10.0.195.33, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/databricks/spark/python/pyspark/worker.py", line 177, in main process() File "/databricks/spark/python/pyspark/worker.py", line 172, in process serializer.dump_stream(func(split_index, iterator), outfile) File "/databricks/spark/python/pyspark/serializers.py", line 285, in dump_stream vs = list(itertools.islice(iterator, batch)) File "/databricks/spark/python/pyspark/sql/session.py", line 520, in prepare verify_func(obj, schema) File "/databricks/spark/python/pyspark/sql/types.py", line 1458, in _verify_type _verify_type(v, f.dataType, f.nullable) File "/databricks/spark/python/pyspark/sql/types.py", line 1422, in _verify_type raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, type(obj))) TypeError: FloatType can not accept object 2 in type at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) at org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.r
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ # Putting fields in alphabetical order masks the issue StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > # Putting fields in alphabetical order masks the issue > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:python} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:python} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:none} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {code:python} from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) {code} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {code:none} > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because {{Row.__new__}} sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because {{Row.__new__}} sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should > be accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2)}} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{ from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) }} > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) > should be accessed by field name, not by position because `Row.__new__` sorts > the fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{ > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > }} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. {{from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2)}} was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > {{from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2)}} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
Bago Amirbekian created SPARK-22232: --- Summary: Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly Key: SPARK-22232 URL: https://issues.apache.org/jira/browse/SPARK-22232 Project: Spark Issue Type: Bug Components: PySpark Affects Versions: 2.2.0 Reporter: Bago Amirbekian The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly
[ https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-22232: Description: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` was: The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be accessed by field name, not by position because `Row.__new__` sorts the fields alphabetically by name. It seems like this promise is not being honored when these Row objects are shuffled. I've included an example to help reproduce the issue. ``` from pyspark.sql.types import * from pyspark.sql import * def toRow(i): return Row(a="a", c=3.0, b=2) schema = StructType([ StructField("a", StringType(), False), StructField("c", FloatType(), False), StructField("b", IntegerType(), False), ]) rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) # As long as we don't shuffle things work fine. print rdd.toDF(schema).take(2) # If we introduce a shuffle we have issues print rdd.repartition(3).toDF(schema).take(2) ``` > Row objects in pyspark using the `Row(**kwars)` syntax do not get > serialized/deserialized properly > -- > > Key: SPARK-22232 > URL: https://issues.apache.org/jira/browse/SPARK-22232 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be > accessed by field name, not by position because `Row.__new__` sorts the > fields alphabetically by name. It seems like this promise is not being > honored when these Row objects are shuffled. I've included an example to help > reproduce the issue. > ``` > from pyspark.sql.types import * > from pyspark.sql import * > def toRow(i): > return Row(a="a", c=3.0, b=2) > schema = StructType([ > StructField("a", StringType(), False), > StructField("c", FloatType(), False), > StructField("b", IntegerType(), False), > ]) > rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i)) > # As long as we don't shuffle things work fine. > print rdd.toDF(schema).take(2) > # If we introduce a shuffle we have issues > print rdd.repartition(3).toDF(schema).take(2) > ``` -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform streaming dataframes
[ https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193392#comment-16193392 ] Bago Amirbekian commented on SPARK-21926: - [~mslipper] The trickiest thing about 1 (b) is knowing how to test that it won't change behaviour. I'd like run this past some folks with more MLlib experience to see if there are any obvious issues with this approach that we haven't considered. > Some transformers in spark.ml.feature fail when trying to transform streaming > dataframes > > > Key: SPARK-21926 > URL: https://issues.apache.org/jira/browse/SPARK-21926 > Project: Spark > Issue Type: Bug > Components: ML, Structured Streaming >Affects Versions: 2.2.0 >Reporter: Bago Amirbekian > > We've run into a few cases where ML components don't play nice with streaming > dataframes (for prediction). This ticket is meant to help aggregate these > known cases in one place and provide a place to discuss possible fixes. > Failing cases: > 1) VectorAssembler where one of the inputs is a VectorUDT column with no > metadata. > Possible fixes: > a) Re-design vectorUDT metadata to support missing metadata for some > elements. (This might be a good thing to do anyways SPARK-19141) > b) drop metadata in streaming context. > 2) OneHotEncoder where the input is a column with no metadata. > Possible fixes: > a) Make OneHotEncoder an estimator (SPARK-13030). > b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator
[ https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193389#comment-16193389 ] Bago Amirbekian commented on SPARK-13030: - Just so I'm clear, does multi-column in this context mean apply one-hot-encoder to each column and then join the resulting vectors? How do you all feel about giving the new OneHotEncoder the same `handleInvalid` semantics as StringIndexer? > Change OneHotEncoder to Estimator > - > > Key: SPARK-13030 > URL: https://issues.apache.org/jira/browse/SPARK-13030 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 1.6.0 >Reporter: Wojciech Jurczyk > > OneHotEncoder should be an Estimator, just like in scikit-learn > (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html). > In its current form, it is impossible to use when number of categories is > different between training dataset and test dataset. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes
Bago Amirbekian created SPARK-21926: --- Summary: Some transformers in spark.ml.feature fail when trying to transform steaming dataframes Key: SPARK-21926 URL: https://issues.apache.org/jira/browse/SPARK-21926 Project: Spark Issue Type: Bug Components: ML, Structured Streaming Affects Versions: 2.2.0 Reporter: Bago Amirbekian We've run into a few cases where ML components don't play nice with streaming dataframes (for prediction). This ticket is meant to help aggregate these known cases in one place and provide a place to discuss possible fixes. Failing cases: 1) VectorAssembler where one of the inputs is a VectorUDT column with no metadata. Possible fixes: a) Re-design vectorUDT metadata to support missing metadata for some elements. (This might be a good thing to do anyways SPARK-19141) b) drop metadata in streaming context. 2) OneHotEncoder where the input is a column with no metadata. Possible fixes: a) Make OneHotEncoder an estimator (SPARK-13030). b) Allow user to set the cardinality of OneHotEncoder. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20862) LogisticRegressionModel throws TypeError
Bago Amirbekian created SPARK-20862: --- Summary: LogisticRegressionModel throws TypeError Key: SPARK-20862 URL: https://issues.apache.org/jira/browse/SPARK-20862 Project: Spark Issue Type: Bug Components: MLlib, PySpark Affects Versions: 2.1.1 Reporter: Bago Amirbekian Priority: Minor LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1: ** File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 155, in __main__.LogisticRegressionModel Failed example: mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) Exception raised: Traceback (most recent call last): File "/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py", line 1330, in __run compileflags, 1), test.globs) File "", line 1, in mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3) File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 398, in train return _regression_train_wrapper(train, LogisticRegressionModel, data, initialWeights) File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 216, in _regression_train_wrapper return modelClass(weights, intercept, numFeatures, numClasses) File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 176, in __init__ self._dataWithBiasSize) TypeError: 'float' object cannot be interpreted as an integer -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators
[ https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022137#comment-16022137 ] Bago Amirbekian commented on SPARK-20861: - [~josephkb] > Pyspark CrossValidator & TrainValidationSplit should delegate parameter > looping to estimators > - > > Key: SPARK-20861 > URL: https://issues.apache.org/jira/browse/SPARK-20861 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.1 >Reporter: Bago Amirbekian >Priority: Minor > > The CrossValidator & TrainValidationSplit should call estimator.fit with all > their parameter maps instead of passing params one by one to fit. This > behaviour would make Python spark more consistent with Scala spark and allow > individual to parallelize or optimize for fitting over multiple parameter > maps. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators
[ https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bago Amirbekian updated SPARK-20861: Comment: was deleted (was: I've made a PR to address this issue: https://github.com/apache/spark/pull/18077.) > Pyspark CrossValidator & TrainValidationSplit should delegate parameter > looping to estimators > - > > Key: SPARK-20861 > URL: https://issues.apache.org/jira/browse/SPARK-20861 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.1 >Reporter: Bago Amirbekian >Priority: Minor > > The CrossValidator & TrainValidationSplit should call estimator.fit with all > their parameter maps instead of passing params one by one to fit. This > behaviour would make Python spark more consistent with Scala spark and allow > individual to parallelize or optimize for fitting over multiple parameter > maps. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators
[ https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022134#comment-16022134 ] Bago Amirbekian commented on SPARK-20861: - I've made a PR to address this issue: https://github.com/apache/spark/pull/18077. > Pyspark CrossValidator & TrainValidationSplit should delegate parameter > looping to estimators > - > > Key: SPARK-20861 > URL: https://issues.apache.org/jira/browse/SPARK-20861 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.1 >Reporter: Bago Amirbekian >Priority: Minor > > The CrossValidator & TrainValidationSplit should call estimator.fit with all > their parameter maps instead of passing params one by one to fit. This > behaviour would make Python spark more consistent with Scala spark and allow > individual to parallelize or optimize for fitting over multiple parameter > maps. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators
Bago Amirbekian created SPARK-20861: --- Summary: Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators Key: SPARK-20861 URL: https://issues.apache.org/jira/browse/SPARK-20861 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.1.1 Reporter: Bago Amirbekian Priority: Minor The CrossValidator & TrainValidationSplit should call estimator.fit with all their parameter maps instead of passing params one by one to fit. This behaviour would make Python spark more consistent with Scala spark and allow individual to parallelize or optimize for fitting over multiple parameter maps. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest
[ https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936902#comment-15936902 ] Bago Amirbekian commented on SPARK-20040: - I'd like to work on this. > Python API for ml.stat.ChiSquareTest > > > Key: SPARK-20040 > URL: https://issues.apache.org/jira/browse/SPARK-20040 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley > > Add PySpark wrapper for ChiSquareTest. Note that it's currently called > ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039] -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org