[jira] [Created] (SPARK-29692) SparkContext.defaultParallism should reflect resource limits when resource limits are set

2019-10-31 Thread Bago Amirbekian (Jira)
Bago Amirbekian created SPARK-29692:
---

 Summary: SparkContext.defaultParallism should reflect resource 
limits when resource limits are set
 Key: SPARK-29692
 URL: https://issues.apache.org/jira/browse/SPARK-29692
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Bago Amirbekian


With the new gpu/fpga support in spark, defaultParallelism may not be computed 
correctly. Specifically defaultParaallelism may be much higher than the total 
possible concurrent tasks if workers have many more cores than gpus for example.

Steps to reproduce:
Start a cluster with spark.executor.resource.gpu.amount < cores per executor. 
Set spark.task.resource.gpu.amount = 1. Keep cores per task as 1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27446) RBackend always uses default values for spark confs

2019-04-11 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-27446:
---

 Summary: RBackend always uses default values for spark confs
 Key: SPARK-27446
 URL: https://issues.apache.org/jira/browse/SPARK-27446
 Project: Spark
  Issue Type: Bug
  Components: SparkR
Affects Versions: 2.4.1
Reporter: Bago Amirbekian


The RBackend and RBackendHandler create new conf objects that don't pick up 
conf values from the existing SparkSession and therefore always use the default 
conf values instead of values specified by the user.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-01 Thread Bago Amirbekian (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16672293#comment-16672293
 ] 

Bago Amirbekian commented on SPARK-25921:
-

[~mengxr] [~jiangxb1987] Could you have a look.

> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-01 Thread Bago Amirbekian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-25921:

Description: 
Running a barrier job after a normal spark job causes the barrier job to run 
without a BarrierTaskContext. Here is some code to reproduce.

 
{code:java}
def task(*args):
  from pyspark import BarrierTaskContext
  context = BarrierTaskContext.get()
  context.barrier()
  print("in barrier phase")
  context.barrier()
  return []
a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
assert a == [0, 1, 4, 9]
b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()

{code}
 
Here is some of the trace

{code:java}
Py4JJavaError: An error occurred while calling 
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Could not 
recover from a failed barrier ResultStage. Most recent failure reason: Stage 
failed because barrier task ResultTask(14, 0) finished unsuccessfully.
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 372, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 367, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
return f(iterator)
  File "", line 4, in task
AttributeError: 'TaskContext' object has no attribute 'barrier'
{code}
 

  was:
Running a barrier job after a normal spark job causes the barrier job to run 
without a BarrierTaskContext. Here is some code to reproduce.

 
{code:java}
def task(*args):
  from pyspark import BarrierTaskContext
  context = BarrierTaskContext.get()
  context.barrier()
  print("in barrier phase")
  context.barrier()
  return []
a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
assert a == [0, 1, 4, 9]
b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()

{code}
 


> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  
> Here is some of the trace
> {code:java}
> Py4JJavaError: An error occurred while calling 
> z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Could 
> not recover from a failed barrier ResultStage. Most recent failure reason: 
> Stage failed because barrier task ResultTask(14, 0) finished unsuccessfully.
> org.apache.spark.api.python.PythonException: Traceback (most recent call 
> last):
>   File "/databricks/spark/python/pyspark/worker.py", line 372, in main
> process()
>   File "/databricks/spark/python/pyspark/worker.py", line 367, in process
> serializer.dump_stream(func(split_index, iterator), outfile)
>   File "/databricks/spark/python/pyspark/rdd.py", line 2482, in func
> return f(iterator)
>   File "", line 4, in task
> AttributeError: 'TaskContext' object has no attribute 'barrier'
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-01 Thread Bago Amirbekian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-25921:

Description: 
Running a barrier job after a normal spark job causes the barrier job to run 
without a BarrierTaskContext. Here is some code to reproduce.

 
{code:java}
def task(*args):
  from pyspark import BarrierTaskContext
  context = BarrierTaskContext.get()
  context.barrier()
  print("in barrier phase")
  context.barrier()
  return []
a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
assert a == [0, 1, 4, 9]
b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()

{code}
 

  was:
Running a barrier job after a normal spark job causes the barrier job to run 
without a BarrierTaskContext. Here is some code to reproduce.



 
{code:java}
def task(*args):
 from pyspark import BarrierTaskContext
 context = BarrierTaskContext.get()
 context.barrier()
 print("in barrier phase")
 context.barrier()
 return []
a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
assert a == [0, 1, 4, 9]
b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()

{code}
 


> Python worker reuse causes Barrier tasks to run without BarrierTaskContext
> --
>
> Key: SPARK-25921
> URL: https://issues.apache.org/jira/browse/SPARK-25921
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, Spark Core
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> Running a barrier job after a normal spark job causes the barrier job to run 
> without a BarrierTaskContext. Here is some code to reproduce.
>  
> {code:java}
> def task(*args):
>   from pyspark import BarrierTaskContext
>   context = BarrierTaskContext.get()
>   context.barrier()
>   print("in barrier phase")
>   context.barrier()
>   return []
> a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
> assert a == [0, 1, 4, 9]
> b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25921) Python worker reuse causes Barrier tasks to run without BarrierTaskContext

2018-11-01 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-25921:
---

 Summary: Python worker reuse causes Barrier tasks to run without 
BarrierTaskContext
 Key: SPARK-25921
 URL: https://issues.apache.org/jira/browse/SPARK-25921
 Project: Spark
  Issue Type: Bug
  Components: PySpark, Spark Core
Affects Versions: 2.4.0
Reporter: Bago Amirbekian


Running a barrier job after a normal spark job causes the barrier job to run 
without a BarrierTaskContext. Here is some code to reproduce.



 
{code:java}
def task(*args):
 from pyspark import BarrierTaskContext
 context = BarrierTaskContext.get()
 context.barrier()
 print("in barrier phase")
 context.barrier()
 return []
a = sc.parallelize(list(range(4))).map(lambda x: x ** 2).collect()
assert a == [0, 1, 4, 9]
b = sc.parallelize(list(range(4)), 4).barrier().mapPartitions(task).collect()

{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-08-28 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-25268:
---

 Summary: runParallelPersonalizedPageRank throws serialization 
Exception
 Key: SPARK-25268
 URL: https://issues.apache.org/jira/browse/SPARK-25268
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.4.0
Reporter: Bago Amirbekian


A recent change to PageRank introduced a bug in the 
ParallelPersonalizedPageRank implementation. The change prevents serialization 
of a Map which needs to be broadcast to all workers. The issue is in this line 
here: 
[https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]

Because graphx units tests are run in local mode, the Serialization issue is 
not caught.

 
{code:java}
[info] - Star example parallel personalized PageRank *** FAILED *** (2 seconds, 
160 milliseconds)
[info] java.io.NotSerializableException: 
scala.collection.immutable.MapLike$$anon$2
[info] Serialization stack:
[info] - object not serializable (class: 
scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
SparseVector(3)((2,1.0
[info] at 
org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
[info] at 
org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
[info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
[info] at 
org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
[info] at 
org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
[info] at 
org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
[info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
[info] at 
org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
[info] at 
org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
[info] at 
org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
[info] at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
[info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info] at org.scalatest.Transformer.apply(Transformer.scala:22)
[info] at org.scalatest.Transformer.apply(Transformer.scala:20)
[info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
[info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
[info] at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
[info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
[info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
[info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
[info] at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
[info] at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
[info] at scala.collection.immutable.List.foreach(List.scala:383)
[info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
[info] at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:396)
[info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:483)
[info] at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:208)
[info] at org.scalatest.FunSuite.runTests(FunSuite.scala:1555)
[info] at org.scalatest.Suite$class.run(Suite.sca

[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt

2018-08-17 Thread Bago Amirbekian (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-25149:

Summary: Personalized Page Rank raises an error if vertexIDs are > MaxInt  
(was: ParallelPersonalizedPageRank raises an error if vertexIDs are > MaxInt)

> Personalized Page Rank raises an error if vertexIDs are > MaxInt
> 
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25149) ParallelPersonalizedPageRank raises an error if vertexIDs are > MaxInt

2018-08-17 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-25149:
---

 Summary: ParallelPersonalizedPageRank raises an error if vertexIDs 
are > MaxInt
 Key: SPARK-25149
 URL: https://issues.apache.org/jira/browse/SPARK-25149
 Project: Spark
  Issue Type: Bug
  Components: GraphX
Affects Versions: 2.3.1
Reporter: Bago Amirbekian
 Fix For: 2.4.0


Looking at the implementation I think we don't actually need this check. The 
current implementation indexes the sparse vectors used and returned by the 
method are index by the _position_ of the vertexId in `sources` not by the 
vertex ID itself. We should remove this check and add a test to confirm the 
implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-24852:
---

 Summary: Have spark.ml training use updated `Instrumentation` APIs.
 Key: SPARK-24852
 URL: https://issues.apache.org/jira/browse/SPARK-24852
 Project: Spark
  Issue Type: Story
  Components: ML
Affects Versions: 2.4.0
Reporter: Bago Amirbekian


Port spark.ml code to use the new methods on the `Instrumentation` class and 
remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-05 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-24747:
---

 Summary: Make spark.ml.util.Instrumentation class more flexible
 Key: SPARK-24747
 URL: https://issues.apache.org/jira/browse/SPARK-24747
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.1
Reporter: Bago Amirbekian


The Instrumentation class (which is an internal private class) is some what 
limited by it's current APIs. The class requires an estimator and dataset be 
passed to the constructor which limits how it can be used. Furthermore, the 
current APIs make it hard to intercept failures and record anything related to 
those failures.

The following changes could make the instrumentation class easier to work with. 
All these changes are for private APIs and should not be visible to users.
{code}
// New no-argument constructor:
Instrumentation()

// New api to log previous constructor arguments.
logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])

logFailure(e: Throwable): Unit

// Log success with no arguments
logSuccess(): Unit

// Log result model explicitly instead of passing to logSuccess
logModel(model: Model[_]): Unit

// On Companion object
Instrumentation.instrumented[T](body: (Instrumentation => T)): T

// The above API will allow us to write instrumented methods more clearly and 
handle logging success and failure automatically:
def someMethod(...): T = instrumented { instr =>
  instr.logNamedValue(name, value)
  // more code here
  instr.logModel(model)
}

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-03-14 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23686:
---

 Summary: Make better usage of 
org.apache.spark.ml.util.Instrumentation
 Key: SPARK-23686
 URL: https://issues.apache.org/jira/browse/SPARK-23686
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


This Jira is a bit high level and might require subtasks or other jiras for 
more specific tasks.

I've noticed that we don't make the best usage of the instrumentation class. 
Specifically sometimes we bypass the instrumentation class and use the debugger 
instead. For example, 
[https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]

Also there are some things that might be useful to log in the instrumentation 
class that we currently don't. For example:

number of training examples
mean/var of label (regression)

I know computing these things can be expensive in some cases, but especially 
when this data is already available we can log it for free. For example, 
Logistic Regression Summarizer computes some useful data including numRows that 
we don't log.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-03-01 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23562:
---

 Summary: RFormula handleInvalid should handle invalid values in 
non-string columns.
 Key: SPARK-23562
 URL: https://issues.apache.org/jira/browse/SPARK-23562
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
String fields. Numeric fields that are null will either cause the transformer 
to fail or might be null in the resulting label column.

I'm not sure what the semantics of keep might be for numeric columns with null 
values, but we should be able to at least support skip for these types.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296
 ] 

Bago Amirbekian edited comment on SPARK-2 at 2/27/18 9:24 PM:
--

[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala#L42


was (Author: bago.amirbekian):
[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23333) SparkML VectorAssembler.transform slow when needing to invoke .first() on sorted DataFrame

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379296#comment-16379296
 ] 

Bago Amirbekian commented on SPARK-2:
-

[~MBALearnsToCode] you can use a `VectorSizeHint` transformer to include 
`numAttributes` in the dataframe column metadata and avoid the call to `first`. 
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorSizeHint.scala

> SparkML VectorAssembler.transform slow when needing to invoke .first() on 
> sorted DataFrame
> --
>
> Key: SPARK-2
> URL: https://issues.apache.org/jira/browse/SPARK-2
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib, SQL
>Affects Versions: 2.2.1
>Reporter: V Luong
>Priority: Minor
>
> Under certain circumstances, newDF = vectorAssembler.transform(oldDF) invokes 
> oldDF.first() in order to establish some metadata/attributes: 
> [https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L88.]
>  When oldDF is sorted, the above triggering of oldDF.first() can be very slow.
> For the purpose of establishing metadata, taking an arbitrary row from oldDF 
> will be just as good as taking oldDF.first(). Is there hence a way we can 
> speed up a great deal by somehow grabbing a random row, instead of relying on 
> oldDF.first()?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379282#comment-16379282
 ] 

Bago Amirbekian commented on SPARK-19947:
-

I think this was resolved by [https://github.com/apache/spark/pull/18496] &; 
[https://github.com/apache/spark/pull/18613].

> RFormulaModel always throws Exception on transforming data with NULL or 
> Unseen labels
> -
>
> Key: SPARK-19947
> URL: https://issues.apache.org/jira/browse/SPARK-19947
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Andrey Yatsuk
>Priority: Major
>
> I have trained ML model and big data table in parquet. I want add new column 
> to this table with predicted values. I can't lose any data, but can having 
> null values in it.
> RFormulaModel.fit() method creates new StringIndexer with default 
> (handleInvalid="error") parameter. Also VectorAssembler on NULL values 
> throwing Exception. So I must call df.na.drop() to transform this DataFrame 
> and I don't want to do this.
> Need add to RFormula new parameter like handleInvalid in StringIndexer.
> Or add transform(Seq): Vector method which user can use as UDF method 
> in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, 
> Seq))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:07 PM:
--

[~Keepun], `train` is a protected API, it's called by `Predictor.fit` which 
also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?


was (Author: bago.amirbekian):
[~Keepun] 
{noformat} train {noformat} is a protected API, it's called by {Predictor.fit} 
which also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM:
--

[~Keepun] 
{code:java}
train
{code}
 is a protected API, it's called by {Predictor.fit} which also copies the 
values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:06 PM:
--

[~Keepun] 
{noformat} train {noformat} is a protected API, it's called by {Predictor.fit} 
which also copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] 
{code:java}
train
{code}
 is a protected API, it's called by {Predictor.fit} which also copies the 
values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM:
--

[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit|?


was (Author: bago.amirbekian):
[~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian edited comment on SPARK-23471 at 2/27/18 8:04 PM:
--

[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit}?


was (Author: bago.amirbekian):
[~Keepun] {train} is a protected API, it's called by {Predictor.fit} which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use {RandomForestClassifier.fit|?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23471) RandomForestClassificationModel save() - incorrect metadata

2018-02-27 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16379202#comment-16379202
 ] 

Bago Amirbekian commented on SPARK-23471:
-

[~Keepun] `train` is a protected API, it's called by `Predictor.fit` which also 
copies the values of Params to the newly created Model instance, 
[here|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Predictor.scala#L118].
 Do you get this same issue if you use `RandomForestClassifier.fit`?

> RandomForestClassificationModel save() - incorrect metadata
> ---
>
> Key: SPARK-23471
> URL: https://issues.apache.org/jira/browse/SPARK-23471
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Keepun
>Priority: Major
>
> RandomForestClassificationMode.load() does not work after save() 
> {code:java}
> RandomForestClassifier rf = new RandomForestClassifier()
> .setFeaturesCol("features")
> .setLabelCol("result")
> .setNumTrees(100)
> .setMaxDepth(30)
> .setMinInstancesPerNode(1)
> //.setCacheNodeIds(true)
> .setMaxMemoryInMB(500)
> .setSeed(System.currentTimeMillis() + System.nanoTime());
> RandomForestClassificationModel rfmodel = rf.train(data);
>try {
>   rfmodel.save(args[2] + "." + System.currentTimeMillis());
>} catch (IOException e) {
>   LOG.error(e.getMessage(), e);
>   e.printStackTrace();
>}
> {code}
> File metadata\part-0: 
> {code:java}
> {"class":"org.apache.spark.ml.classification.RandomForestClassificationModel",
> "timestamp":1519136783983,"sparkVersion":"2.2.1","uid":"rfc_7c7e84ce7488",
> "paramMap":{"featureSubsetStrategy":"auto","cacheNodeIds":false,"impurity":"gini",
> "checkpointInterval":10,
> "numTrees":20,"maxDepth":5,
> "probabilityCol":"probability","labelCol":"label","featuresCol":"features",
> "maxMemoryInMB":256,"minInstancesPerNode":1,"subsamplingRate":1.0,
> "rawPredictionCol":"rawPrediction","predictionCol":"prediction","maxBins":32,
> "minInfoGain":0.0,"seed":-491520797},"numFeatures":1354,"numClasses":2,
> "numTrees":20}
> {code}
> should be:
> {code:java}
> "numTrees":100,"maxDepth":30,{code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23265) Update multi-column error handling logic in QuantileDiscretizer

2018-02-15 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16366277#comment-16366277
 ] 

Bago Amirbekian commented on SPARK-23265:
-

What's the status of this? Will this be a change in behaviour?

> Update multi-column error handling logic in QuantileDiscretizer
> ---
>
> Key: SPARK-23265
> URL: https://issues.apache.org/jira/browse/SPARK-23265
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Nick Pentreath
>Priority: Major
>
> SPARK-22397 added support for multiple columns to {{QuantileDiscretizer}}. If 
> both single- and mulit-column params are set (specifically {{inputCol}} / 
> {{inputCols}}) an error is thrown.
> However, SPARK-22799 added more comprehensive error logic for {{Bucketizer}}. 
> The logic for {{QuantileDiscretizer}} should be updated to match. *Note* that 
> for this transformer, it is acceptable to set the single-column param for 
> {{numBuckets }}when transforming multiple columns, since that is then applied 
> to all columns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-23377:

Description: 
A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}

And the trace:

{code:java}
java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has the 
inputCols Param set for multi-column transform. The following Params are not 
applicable and should not be set: outputCol.
at 
org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
at 
org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
at 
org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
at 
org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
at 
line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)

{code}



  was:
A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}


> Bucketizer with multiple columns persistence bug
> 
>
> Key: SPARK-23377
> URL: https://issues.apache.org/jira/browse/SPARK-23377
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> A Bucketizer with multiple input/output columns get "inputCol" set to the 
> default value on write -> read which causes it to throw an error on 
> transform. Here's an example.
> {code:java}
> import org.apache.spark.ml.feature._
> val splits = Array(Double.NegativeInfinity, 0, 10, 100, 
> Double.PositiveInfinity)
> val bucketizer = new Bucketizer()
>   .setSplitsArray(Array(splits, splits))
>   .setInputCols(Array("foo1", "foo2"))
>   .setOutputCols(Array("bar1", "bar2"))
> val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
> bucketizer.transform(data)
> val path = "/temp/bucketrizer-persist-test"
> bucketizer.write.overwrite.save(path)
> val bucketizerAfterRead = Bucketizer.read.load(path)
> println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
> // This line throws an error because "outputCol" is set
> bucketizerAfterRead.transform(data)
> {code}
> And the trace:
> {code:java}
> java.lang.IllegalArgumentException: Bucketizer bucketizer_6f0acc3341f7 has 
> the inputCols Param set for multi-column transform. The following Params are 
> not applicable and should not be set: outputCol.
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkExclusiveParams$1(params.scala:300)
>   at 
> org.apache.spark.ml.param.ParamValidators$.checkSingleVsMultiColumnParams(params.scala:314)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transformSchema(Bucketizer.scala:189)
>   at 
> org.apache.spark.ml.feature.Bucketizer.transform(Bucketizer.scala:141)
>   at 
> line251821108a8a433da484ee31f166c83725.$read$$iw$$iw$$iw$$iw$$iw$$iw.(command-6079631:17)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23377) Bucketizer with multiple columns persistence bug

2018-02-09 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23377:
---

 Summary: Bucketizer with multiple columns persistence bug
 Key: SPARK-23377
 URL: https://issues.apache.org/jira/browse/SPARK-23377
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


A Bucketizer with multiple input/output columns get "inputCol" set to the 
default value on write -> read which causes it to throw an error on transform. 
Here's an example.


{code:java}
import org.apache.spark.ml.feature._

val splits = Array(Double.NegativeInfinity, 0, 10, 100, Double.PositiveInfinity)
val bucketizer = new Bucketizer()
  .setSplitsArray(Array(splits, splits))
  .setInputCols(Array("foo1", "foo2"))
  .setOutputCols(Array("bar1", "bar2"))

val data = Seq((1.0, 2.0), (10.0, 100.0), (101.0, -1.0)).toDF("foo1", "foo2")
bucketizer.transform(data)

val path = "/temp/bucketrizer-persist-test"
bucketizer.write.overwrite.save(path)
val bucketizerAfterRead = Bucketizer.read.load(path)
println(bucketizerAfterRead.isDefined(bucketizerAfterRead.outputCol))
// This line throws an error because "outputCol" is set
bucketizerAfterRead.transform(data)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340448#comment-16340448
 ] 

Bago Amirbekian commented on SPARK-23105:
-

[~mlnick] We can update the sub tasks to target 2.3 if you think it's 
appropriate. I don't know how involved Joseph can be for this release so we 
might need another committer to shepard these tasks, I can take on some of it.

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23109) ML 2.3 QA: API: Python API coverage

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340399#comment-16340399
 ] 

Bago Amirbekian commented on SPARK-23109:
-

[~bryanc] One reason the python API might be different is because in python we 
can use `imageRow.height` in place of `getHeight(imageRow)` so the getters 
don't add much value. Also, `toNDArray` doesn't make sense in Scala. I think we 
should add `columnSchema` to the python API, but it doesn't need to be block 
the release IMHO.

> ML 2.3 QA: API: Python API coverage
> ---
>
> Key: SPARK-23109
> URL: https://issues.apache.org/jira/browse/SPARK-23109
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, PySpark
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> For new public APIs added to MLlib ({{spark.ml}} only), we need to check the 
> generated HTML doc and compare the Scala & Python versions.
> * *GOAL*: Audit and create JIRAs to fix in the next release.
> * *NON-GOAL*: This JIRA is _not_ for fixing the API parity issues.
> We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> *Please use a _separate_ JIRA (linked below as "requires") for this list of 
> to-do items.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234
 ] 

Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM:
---

I ran mina in branch-2.3 and got the following output:

{code}
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
{code}

I don't think I can assign this task to myself or change its status.


was (Author: bago.amirbekian):
I ran mina in branch-2.3 and got the following output:

 
{code}
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
{code}

I don't think I can assign this task to myself or change its status.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian resolved SPARK-23106.
-
Resolution: Resolved

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234
 ] 

Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM:
---

I ran mina in branch-2.3 and got the following output:

 
{code}
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
{code}

I don't think I can assign this task to myself or change its status.


was (Author: bago.amirbekian):
I ran mina in branch-2.3 and got the following output:
{{
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
}}

I don't think I can assign this task to myself or change its status.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234
 ] 

Bago Amirbekian edited comment on SPARK-23106 at 1/25/18 10:49 PM:
---

I ran mina in branch-2.3 and got the following output:

{code}
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
{code}

I don't think I can assign this task to myself.


was (Author: bago.amirbekian):
I ran mina in branch-2.3 and got the following output:

{code}
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
{code}

I don't think I can assign this task to myself or change its status.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23106) ML, Graph 2.3 QA: API: Binary incompatible changes

2018-01-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340234#comment-16340234
 ] 

Bago Amirbekian commented on SPARK-23106:
-

I ran mina in branch-2.3 and got the following output:
{{
[info] spark-graphx: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-graphx_2.11:2.2.0 (filtered 1)
[info] spark-streaming-kafka-0-10-assembly: previous-artifact not set, not 
analyzing binary compatibility
[info] spark-streaming-kafka-0-10: found 0 potential binary incompatibilities 
while checking against org.apache.spark:spark-streaming-kafka-0-10_2.11:2.2.0 
(filtered 3)
[info] spark-catalyst: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-streaming: found 0 potential binary incompatibilities while 
checking against org.apache.spark:spark-streaming_2.11:2.2.0 (filtered 25)
[info] spark-sql-kafka-0-10: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-hive: previous-artifact not set, not analyzing binary compatibility
[info] spark-repl: previous-artifact not set, not analyzing binary compatibility
[info] spark-assembly: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-examples: previous-artifact not set, not analyzing binary 
compatibility
[info] spark-core: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-core_2.11:2.2.0 (filtered 1112)
[info] spark-mllib: found 0 potential binary incompatibilities while checking 
against org.apache.spark:spark-mllib_2.11:2.2.0 (filtered 143)
}}

I don't think I can assign this task to myself or change its status.

> ML, Graph 2.3 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-23106
> URL: https://issues.apache.org/jira/browse/SPARK-23106
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23048) Update mllib docs to replace OneHotEncoder with OneHotEncoderEstimator

2018-01-11 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23048:
---

 Summary: Update mllib docs to replace OneHotEncoder with 
OneHotEncoderEstimator 
 Key: SPARK-23048
 URL: https://issues.apache.org/jira/browse/SPARK-23048
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


Since we're deprecating OneHotEncoder, we should update the docs to reference 
it's replacement, OneHotEncoderEstimator.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23046) Have RFormula include VectorSizeHint in pipeline

2018-01-11 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23046:
---

 Summary: Have RFormula include VectorSizeHint in pipeline
 Key: SPARK-23046
 URL: https://issues.apache.org/jira/browse/SPARK-23046
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23045) Have RFormula use OneHoEncoderEstimator

2018-01-11 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-23045:

Summary: Have RFormula use OneHoEncoderEstimator  (was: Have RFormula use 
OneHotEstimator)

> Have RFormula use OneHoEncoderEstimator
> ---
>
> Key: SPARK-23045
> URL: https://issues.apache.org/jira/browse/SPARK-23045
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-11 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23037?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-23037:

Affects Version/s: (was: 2.2.0)
   2.3.0

> RFormula should not use deprecated OneHotEncoder and should include 
> VectorSizeHint in pipeline
> --
>
> Key: SPARK-23037
> URL: https://issues.apache.org/jira/browse/SPARK-23037
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23045) Have RFormula use OneHotEstimator

2018-01-11 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23045:
---

 Summary: Have RFormula use OneHotEstimator
 Key: SPARK-23045
 URL: https://issues.apache.org/jira/browse/SPARK-23045
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23037) RFormula should not use deprecated OneHotEncoder and should include VectorSizeHint in pipeline

2018-01-10 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-23037:
---

 Summary: RFormula should not use deprecated OneHotEncoder and 
should include VectorSizeHint in pipeline
 Key: SPARK-23037
 URL: https://issues.apache.org/jira/browse/SPARK-23037
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning

2018-01-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313958#comment-16313958
 ] 

Bago Amirbekian edited comment on SPARK-22126 at 1/5/18 9:48 PM:
-

> Do you think it's possible to put this kind of execution in fitMultiple and 
> allow CV to parallelize work for the stages?

Yes, absolutely. The iterator can maintain a queue of tasks. Each call to 
`next` will pick a task off the queue, optionally add more tasks to the queue 
and return a single model instance. Since models can be returned in any order, 
the tasks can be organized however is needed to optimally finish the work. If 
the queue is empty (but the iterator isn't finished), `next` can simply wait 
for a previous task to finish and put more tasks on the queue. The iterator 
approach is very flexible.

> With my PR, I would do this by having the Pipeline estimator return all 
> params in getOptimizedParams.

The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed 
the order that you want the models returned. For my purposes I don't see much 
of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. 
The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, 
Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce 
the models in arbitrary order.


was (Author: bago.amirbekian):
> Do you think it's possible to put this kind of execution in fitMultiple and 
> allow CV to parallelize work for the stages?

Yes, absolutely. The iterator can maintain a queue of tasks. Each call to 
`next` will pick a task off the queue, optionally add more tasks to the queue 
and return a single model instance. Since models can be returned in any order, 
the tasks can be organized however is needed to optimally finish the work. If 
the queue is empty, `next` can simply wait for a previous task to finish and 
put more tasks on the queue. The iterator approach is very flexible.

> With my PR, I would do this by having the Pipeline estimator return all 
> params in getOptimizedParams.

The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed 
the order that you want the models returned. For my purposes I don't see much 
of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. 
The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, 
Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce 
the models in arbitrary order.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example a

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2018-01-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16313958#comment-16313958
 ] 

Bago Amirbekian commented on SPARK-22126:
-

> Do you think it's possible to put this kind of execution in fitMultiple and 
> allow CV to parallelize work for the stages?

Yes, absolutely. The iterator can maintain a queue of tasks. Each call to 
`next` will pick a task off the queue, optionally add more tasks to the queue 
and return a single model instance. Since models can be returned in any order, 
the tasks can be organized however is needed to optimally finish the work. If 
the queue is empty, `next` can simply wait for a previous task to finish and 
put more tasks on the queue. The iterator approach is very flexible.

> With my PR, I would do this by having the Pipeline estimator return all 
> params in getOptimizedParams.

The issue here is that when you call `fit(dataset, paramMaps)` you've now fixed 
the order that you want the models returned. For my purposes I don't see much 
of a difference between `Seq[(Int, Model)]` and `Iterator[(Integer, Mode)]`. 
The key difference for me between `fitMutliple(..., paramMaps): Lazy[(Int, 
Model)]` and `fit(..., paramMaps): Lazy[Model]` is the flexibility to produce 
the models in arbitrary order.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   /

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2018-01-04 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16312264#comment-16312264
 ] 

Bago Amirbekian commented on SPARK-22126:
-

[~bryanc] thanks for taking the time to put together the PR and share thoughts. 
I like the idea of being able to preserve the existing APIs and not needing to 
add a new fitMultiple API but I'm concerned the existing APIs aren't quite 
flexible enough.

One of the use cases that motivated the {{ fitMultiple }} API was optimizing 
the Pipeline Estimator. The Pipeline Estimator seems like in important one to 
optimize because I believe it's required in order for CrossValidator to be able 
to exploit optimized implementations of the {{ fit }}/{{ fitMultiple }} methods 
of Pipeline stages.

The way one would optimize the Pipeline Estimator is to group the paramMaps 
into a tree structure where each level represents a stage with a param that can 
take multiple values. One would then traverse the tree in depth first order. 
Notice that in this case the params need not be estimator params, but could 
actually be transformer params as well since we can avoid applying expensive 
transformers multiple times.

With this approach all the params for a pipeline estimator after the top level 
of the tree are "optimizable" so simply being group on optimizable params isn't 
sufficient, we need to actually order the paramMaps to match the depth first 
traversal of the stages tree.

I'm still thinking through all this in my head so let me know if any of it is 
off base or not clear, but I think the advantage of the {{ fitMultiple }} 
approach gives us full flexibility in order to these kinds of optimizations.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> Anyone who's following might want to scan the design doc (in the links 
> above), the latest api proposal is:
> {code}
> def fitMultiple(
> dataset: Dataset[_],
> paramMaps: Array[ParamMap]
>   ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]
> {code}
> Old discussion:
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the 

[jira] [Created] (SPARK-22949) Reduce memory requirement for TrainValidationSplit

2018-01-03 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22949:
---

 Summary: Reduce memory requirement for TrainValidationSplit
 Key: SPARK-22949
 URL: https://issues.apache.org/jira/browse/SPARK-22949
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.0
Reporter: Bago Amirbekian
Priority: Critical


There was a fix in {{ CrossValidator }} to reduce memory usage on the driver, 
the same patch to be applied to {{ TrainValidationSplit }}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22922) Python API for fitMultiple

2017-12-28 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22922:
---

 Summary: Python API for fitMultiple
 Key: SPARK-22922
 URL: https://issues.apache.org/jira/browse/SPARK-22922
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Bago Amirbekian


Implement fitMultiple method on Estimator for pyspark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-18 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295964#comment-16295964
 ] 

Bago Amirbekian edited comment on SPARK-22126 at 12/19/17 12:55 AM:


Anyone who's following might want to scan the design doc (in the links above), 
the latest api proposal is:

{code}
def fitMultiple(
dataset: Dataset[_],
paramMaps: Array[ParamMap]
  ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]

{code}


was (Author: bago.amirbekian):
Anyone who's following might want to scan the design doc, the latest api 
proposal is:

{code}
def fitMultiple(
dataset: Dataset[_],
paramMaps: Array[ParamMap]
  ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]

{code}

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { mode

[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-18 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16295964#comment-16295964
 ] 

Bago Amirbekian commented on SPARK-22126:
-

Anyone who's following might want to scan the design doc, the latest api 
proposal is:

{code}
def fitMultiple(
dataset: Dataset[_],
paramMaps: Array[ParamMap]
  ): java.util.Iterator[scala.Tuple2[java.lang.Integer, Model]]

{code}

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>
>   modelMap.map { case (index: Int, model: Model[_]) =>
> val metric = eval.evaluate(model.transform(validationDataset, 
> paramMaps(index)))
> (index, metric)
>   }
> } (executionContext)
>   }
>   // Wait for metrics to be calculated before unpersisting validation 
> dataset
>   

[jira] [Updated] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22811:

Priority: Minor  (was: Major)

> pyspark.ml.tests is missing a py4j import.
> --
>
> Key: SPARK-22811
> URL: https://issues.apache.org/jira/browse/SPARK-22811
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Minor
>
> This bug isn't getting caught because the relevant code only gets run if the 
> test environment does not have Hive.
> https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22811) pyspark.ml.tests is missing a py4j import.

2017-12-15 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22811:
---

 Summary: pyspark.ml.tests is missing a py4j import.
 Key: SPARK-22811
 URL: https://issues.apache.org/jira/browse/SPARK-22811
 Project: Spark
  Issue Type: Bug
  Components: ML, PySpark
Affects Versions: 2.3.0
Reporter: Bago Amirbekian


This bug isn't getting caught because the relevant code only gets run if the 
test environment does not have Hive.

https://github.com/apache/spark/blob/46776234a49742e94c64897322500582d7393d35/python/pyspark/ml/tests.py#L1866



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-07 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283145#comment-16283145
 ] 

Bago Amirbekian commented on SPARK-22126:
-

Joseph, the way I read your comment is to say that we should support 
parallelism & optimized model but not parallelism with optimized models. I 
think that would cover our current use cases, but I'm wondering if we want to 
leave open the possibility of optimizing parameters like maxIter & maxDepth and 
have those optimized implements play nice with parallelism in CrossValidator.

I normally believe in doing the simple thing first and then changing it if 
needed, but it would requiring adding another public API later.

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0
> I copy discussion from gist to here:
> I propose to design API as:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]): 
> Array[Callable[Map[Int, M]]]
> {code}
> Let me use an example to explain the API:
> {quote}
>  It could be possible to still use the current parallelism and still allow 
> for model-specific optimizations. For example, if we doing cross validation 
> and have a param map with regParam = (0.1, 0.3) and maxIter = (5, 10). Lets 
> say that the cross validator could know that maxIter is optimized for the 
> model being evaluated (e.g. a new method in Estimator that return such 
> params). It would then be straightforward for the cross validator to remove 
> maxIter from the param map that will be parallelized over and use it to 
> create 2 arrays of paramMaps: ((regParam=0.1, maxIter=5), (regParam=0.1, 
> maxIter=10)) and ((regParam=0.3, maxIter=5), (regParam=0.3, maxIter=10)).
> {quote}
> In this example, we can see that, models computed from ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)) can only be computed in one thread 
> code, models computed from ((regParam=0.3, maxIter=5), (regParam=0.3, 
> maxIter=10))  in another thread. In this example, there're 4 paramMaps, but 
> we can at most generate two threads to compute the models for them.
> The API above allow "callable.call()" to return multiple models, and return 
> type is {code}Map[Int, M]{code}, key is integer, used to mark the paramMap 
> index for corresponding model. Use the example above, there're 4 paramMaps, 
> but only return 2 callable objects, one callable object for ((regParam=0.1, 
> maxIter=5), (regParam=0.1, maxIter=10)), another one for ((regParam=0.3, 
> maxIter=5), (regParam=0.3, maxIter=10)).
> and the default "fitCallables/fit with paramMaps" can be implemented as 
> following:
> {code}
> def fitCallables(dataset: Dataset[_], paramMaps: Array[ParamMap]):
> Array[Callable[Map[Int, M]]] = {
>   paramMaps.zipWithIndex.map { case (paramMap: ParamMap, index: Int) =>
> new Callable[Map[Int, M]] {
>   override def call(): Map[Int, M] = Map(index -> fit(dataset, paramMap))
> }
>   }
> }
> def fit(dataset: Dataset[_], paramMaps: Array[ParamMap]): Seq[M] = {
>fitCallables(dataset, paramMaps).map { _.call().toSeq }
>  .flatMap(_).sortBy(_._1).map(_._2)
> }
> {code}
> If use the API I proposed above, the code in 
> [CrossValidation|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tuning/CrossValidator.scala#L149-L159]
> can be changed to:
> {code}
>   val trainingDataset = sparkSession.createDataFrame(training, 
> schema).cache()
>   val validationDataset = sparkSession.createDataFrame(validation, 
> schema).cache()
>   // Fit models in a Future for training in parallel
>   val modelMapFutures = fitCallables(trainingDataset, paramMaps).map { 
> callable =>
>  Future[Map[Int, Model[_]]] {
> val modelMap = callable.call()
> if (collectSubModelsParam) {
>...
> }
> modelMap
>  } (executionContext)
>   }
>   // Unpersist training data only when all models have trained
>   Future.sequence[Model[_], Iterable](modelMapFutures)(implicitly, 
> executionContext)
> .onComplete { _ => trainingDataset.unpersist() } (executionContext)
>   // Evaluate models in a Future that will calulate a metric and allow 
> model to be cleaned up
>   val foldMetricMapFutures = modelMapFutures.map { modelMapFuture =>
> modelMapFuture.map { modelMap =>
>   modelMap.map { case (index: Int, model: Mo

[jira] [Created] (SPARK-22734) Create Python API for VectorSizeHint

2017-12-07 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22734:
---

 Summary: Create Python API for VectorSizeHint
 Key: SPARK-22734
 URL: https://issues.apache.org/jira/browse/SPARK-22734
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22735) Add VectorSizeHint to ML features documentation

2017-12-07 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22735:
---

 Summary: Add VectorSizeHint to ML features documentation
 Key: SPARK-22735
 URL: https://issues.apache.org/jira/browse/SPARK-22735
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Bago Amirbekian






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279537#comment-16279537
 ] 

Bago Amirbekian edited comment on SPARK-22126 at 12/6/17 2:19 AM:
--

[~WeichenXu123] Sorry I misunderstood, I thought you wanted to use 
java.concurrent.Callable. Making our own trait makes sense.

https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html


was (Author: bago.amirbekian):
[~WeichenXu123] Sorry I misunderstood, I thought you wanted to use 
java.concurrent.Callable.

https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279537#comment-16279537
 ] 

Bago Amirbekian commented on SPARK-22126:
-

[~WeichenXu123] Sorry I misunderstood, I thought you wanted to use 
java.concurrent.Callable.

https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Callable.html

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218
 ] 

Bago Amirbekian edited comment on SPARK-22126 at 12/5/17 9:58 PM:
--

I started a discussion about potential to this issue on this 
[gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm 
going to summarize the gist here here and encourage further discussion to take 
place on this JIRA to increase visibility of the discussion.

I proposed that we add a new method to the {{Estimator}} interface 
{{fitMultiple(dataset, paramMaps): Array[Callable[Model]]}}. The purpose of 
this method is to allow estimators to implement model specific optimizations 
for fitting each model with multiple paramMaps. This API will also be use by 
{{CrossValidator}} and other meta transformers when fitting multiple models in 
parallel.

[~WeichenXu123] suggested modifying the API to {{fitMultiple(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]}}. The 
reasoning is that allowing each callable to return multiple models will make it 
easier to efficiently schedule these tasks in parallel (eg we will avoid 
scheduling A and B where B simply waits on A).


was (Author: bago.amirbekian):
I started a discussion about potential to this issue on this 
[gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm 
going to summarize the gist here here and encourage further discussion to take 
place on this JIRA to increase visibility of the discussion.


I proposed that we add a new method to the `Estimator` interface 
`fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this 
method is to allow estimators to implement model specific optimizations for 
fitting each model with multiple paramMaps. This API will also be use by 
`CrossValidator` and other meta transformers when fitting multiple models in 
parallel.

[~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The 
reasoning is that allowing each callable to return multiple models will make it 
easier to efficiently schedule these tasks in parallel (eg we will avoid 
scheduling A and B where B simply waits on A).

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218
 ] 

Bago Amirbekian edited comment on SPARK-22126 at 12/5/17 9:53 PM:
--

I started a discussion about potential to this issue on this 
[gist|https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0]. I'm 
going to summarize the gist here here and encourage further discussion to take 
place on this JIRA to increase visibility of the discussion.


I proposed that we add a new method to the `Estimator` interface 
`fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this 
method is to allow estimators to implement model specific optimizations for 
fitting each model with multiple paramMaps. This API will also be use by 
`CrossValidator` and other meta transformers when fitting multiple models in 
parallel.

[~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The 
reasoning is that allowing each callable to return multiple models will make it 
easier to efficiently schedule these tasks in parallel (eg we will avoid 
scheduling A and B where B simply waits on A).


was (Author: bago.amirbekian):
I started a discussion about potential to this issue on this gist. I'm going to 
summarize the gist here here and encourage further discussion to take place on 
this JIRA to increase visibility of the discussion.

I proposed that we add a new method to the `Estimator` interface 
`fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this 
method is to allow estimators to implement model specific optimizations for 
fitting each model with multiple paramMaps. This API will also be use by 
`CrossValidator` and other meta transformers when fitting multiple models in 
parallel.

[~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The 
reasoning is that allowing each callable to return multiple models will make it 
easier to efficiently schedule these tasks in parallel (eg we will avoid 
scheduling A and B where B simply waits on A).

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22126) Fix model-specific optimization support for ML tuning

2017-12-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16279218#comment-16279218
 ] 

Bago Amirbekian commented on SPARK-22126:
-

I started a discussion about potential to this issue on this gist. I'm going to 
summarize the gist here here and encourage further discussion to take place on 
this JIRA to increase visibility of the discussion.

I proposed that we add a new method to the `Estimator` interface 
`fitMultiple(dataset, paramMaps): Array[Callable[Model]]`. The purpose of this 
method is to allow estimators to implement model specific optimizations for 
fitting each model with multiple paramMaps. This API will also be use by 
`CrossValidator` and other meta transformers when fitting multiple models in 
parallel.

[~WeichenXu123] suggested modifying the API to `fitMultiple(dataset: 
Dataset[_], paramMaps: Array[ParamMap]): Array[Callable[Map[Int, M]]]`. The 
reasoning is that allowing each callable to return multiple models will make it 
easier to efficiently schedule these tasks in parallel (eg we will avoid 
scheduling A and B where B simply waits on A).

> Fix model-specific optimization support for ML tuning
> -
>
> Key: SPARK-22126
> URL: https://issues.apache.org/jira/browse/SPARK-22126
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>
> Fix model-specific optimization support for ML tuning. This is discussed in 
> SPARK-19357
> more discussion is here
>  https://gist.github.com/MrBago/f501b9e7712dc6a67dc9fea24e309bf0



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF

2017-11-21 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261393#comment-16261393
 ] 

Bago Amirbekian commented on SPARK-20586:
-

Also a follow up questions, are the performance implications to using the 
`deterministic` flag we should try and avoid by restructuring the ml UDFs to 
avoid raising within the UDF.

> Add deterministic to ScalaUDF
> -
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20586) Add deterministic to ScalaUDF

2017-11-21 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16261377#comment-16261377
 ] 

Bago Amirbekian commented on SPARK-20586:
-

Is there some documentation somewhere about the right way to use this 
`deterministic` flag?

I bring it up because in `spark.ml` sometimes we will raise errors in a udf 
when a dataFrame contains invalid data. This can cause bad behavior if the 
optimizer re-orders the udf with other operations so we’re marking these udfs 
as `nonDeterministic` (https://github.com/apache/spark/pull/19662) but that 
somehow seems wrong. The issue isn’t that the udfs are non-deterministic, they 
are deterministic and always raise on the same inputs.

I guess my question is
1) Is this correct usage of the non-deterministic flag or are we simply abusing 
it when we should come up with a more specific solution?
2) If this is correct usage could we rename the flag or document this type of 
usage somewhere?

> Add deterministic to ScalaUDF
> -
>
> Key: SPARK-20586
> URL: https://issues.apache.org/jira/browse/SPARK-20586
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Xiao Li
>Assignee: Xiao Li
> Fix For: 2.3.0
>
>
> https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html
> Like Hive UDFType, we should allow users to add the extra flags for ScalaUDF 
> too. {{stateful}}/{{impliesOrder}} are not applicable to ScalaUDF. Thus, we 
> only add the following two flags. 
> - deterministic: Certain optimizations should not be applied if UDF is not 
> deterministic. Deterministic UDF returns same result each time it is invoked 
> with a particular input. This determinism just needs to hold within the 
> context of a query.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

2017-11-10 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248049#comment-16248049
 ] 

Bago Amirbekian commented on SPARK-22346:
-

I think [~josephkb]'s version of Option 3 makes the most sense. A transformer 
that adds size data to a vector column would allow patching pipelines pretty 
easily and it could be implemented without breaking any APIs.

I'm currently working on a PR based on this approach. 

> Update VectorAssembler to work with Structured Streaming
> 
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-25 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219109#comment-16219109
 ] 

Bago Amirbekian commented on SPARK-22346:
-

Nick I see that options as a stepping stone to option 2 above (drop metadata 
for vector columns). If we create this param, we're in essence saying to 
downstream transformers, "don't rely on the availability of metadata". And if 
that's the case, metadata should be considered an optimization, and the 
presence/absence of metadata should never change the result.  

> Update VectorAssembler to work with StreamingDataframes
> ---
>
> Key: SPARK-22346
> URL: https://issues.apache.org/jira/browse/SPARK-22346
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and 
> assemble a output a new column of VectorType containing the concatenated 
> vectors. In streaming mode, this transformation can fail because 
> VectorAssembler does not have enough information to produce metadata 
> (AttributeGroup) for the new column. Because VectorAssembler is such a 
> ubiquitous part of mllib pipelines, this issue effectively means spark 
> structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve 
> VectorAssembler. Please let me know if there are any issues I have not 
> considered or potential fixes I haven't outlined. I'm happy to submit a patch 
> once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently 
> done with OneHotEncoder, 
> [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The 
> Estimator can "learn" the size of the inputs vectors during training and save 
> it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty 
> major change, but it could be done in stages. We could first ensure that 
> metadata is not used during prediction and allow the VectorAssembler to drop 
> metadata for streaming dataframes. Going forward, it would be important to 
> not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg 
> [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would 
> most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used 
> during transform, only during fit) might be tricky. This would require 
> testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib 
> transformers that produce vectors already include the size of the vector in 
> the column metadata. This change would be to deprecate APIs that allow 
> creating a vector column of unknown length and replace those APIs with 
> equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes 
> the inputs * output col are fixed size vectors and creates metadata 
> accordingly. In the spirit of explicit is better than implicit, we would be 
> codifying something we already assume.
> * This could potentially enable performance optimizations that are only 
> possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, 
[SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator 
can "learn" the size of the inputs vectors during training and save it to use 
during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
(drop metadata for vector columns in streaming).
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, 
[SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator 
can "learn" the size of the inputs vectors during training and save it to use 
during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely r

[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-21926:

Description: 
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here SPARK-22346.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.

  was:
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here 
[Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346].

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.


> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-21926:

Description: 
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here SPARK-22346.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
-b) Allow user to set the cardinality of OneHotEncoder.-

  was:
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here SPARK-22346.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.


> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-21926:

Description: 
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here Spark-22346.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.

  was:
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
I've created a jira to track this 

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.


> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here Spark-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-21926:

Description: 
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
I've created a jira to track this 

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.

  was:
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
a) Re-design vectorUDT metadata to support missing metadata for some elements. 
(This might be a good thing to do anyways SPARK-19141)
b) drop metadata in streaming context.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.


> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> I've created a jira to track this 
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2017-10-25 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-21926:

Description: 
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here 
[Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346].

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.

  was:
We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
More details here Spark-22346.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.


> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here 
> [Spark-22346|https://issues.apache.org/jira/browse/SPARK-22346].
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, 
[SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator 
can "learn" the size of the inputs vectors during training and save it to use 
during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
du

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[#SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechani

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, eg 
[#SPARK-19141].

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, .

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler
* Current Attributes implementation is also causing other issues, .

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the siz

[jira] [Updated] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22346:

Description: 
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.

Pros:
* Possibly simplest of the potential fixes

Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.

Pros:
* Potentially, easy short term fix for VectorAssembler

Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.

Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.

Cons:
* This would require breaking changes.




  was:
The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to depreca

[jira] [Created] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

2017-10-24 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22346:
---

 Summary: Update VectorAssembler to work with StreamingDataframes
 Key: SPARK-22346
 URL: https://issues.apache.org/jira/browse/SPARK-22346
 Project: Spark
  Issue Type: Improvement
  Components: ML, Structured Streaming
Affects Versions: 2.2.0
Reporter: Bago Amirbekian
Priority: Critical


The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and 
assemble a output a new column of VectorType containing the concatenated 
vectors. In streaming mode, this transformation can fail because 
VectorAssembler does not have enough information to produce metadata 
(AttributeGroup) for the new column. Because VectorAssembler is such a 
ubiquitous part of mllib pipelines, this issue effectively means spark 
structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve 
VectorAssembler. Please let me know if there are any issues I have not 
considered or potential fixes I haven't outlined. I'm happy to submit a patch 
once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done 
with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs 
vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major 
change, but it could be done in stages. We could first ensure that metadata is 
not used during prediction and allow the VectorAssembler to drop metadata for 
streaming dataframes. Going forward, it would be important to not use any 
metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would 
most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used 
during transform, only during fit) might be tricky. This would require testing 
or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers 
that produce vectors already include the size of the vector in the column 
metadata. This change would be to deprecate APIs that allow creating a vector 
column of unknown length and replace those APIs with equivalents that enforce a 
fixed size.
Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes 
the inputs * output col are fixed size vectors and creates metadata 
accordingly. In the spirit of explicit is better than implicit, we would be 
codifying something we already assume.
* This could potentially enable performance optimizations that are only 
possible if the Vector size of a column is fixed & known.
Cons:
* This would require breaking changes.






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-13 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Component/s: SQL

> Row objects in pyspark created using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark created using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Summary: Row objects in pyspark created using the `Row(**kwars)` syntax do 
not get serialized/deserialized properly  (was: Row objects in pyspark using 
the `Row(**kwars)` syntax do not get serialized/deserialized properly)

> Row objects in pyspark created using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16198068#comment-16198068
 ] 

Bago Amirbekian commented on SPARK-22232:
-

Full trace:

{code:none}
[Row(a=u'a', c=3.0, b=2), Row(a=u'a', c=3.0, b=2)]


---
Py4JJavaError Traceback (most recent call last)
 in ()
 17 
 18 # If we introduce a shuffle we have issues
---> 19 print rdd.repartition(3).toDF(schema).take(2)

/databricks/spark/python/pyspark/sql/dataframe.pyc in take(self, num)
475 [Row(age=2, name=u'Alice'), Row(age=5, name=u'Bob')]
476 """
--> 477 return self.limit(num).collect()
478 
479 @since(1.3)

/databricks/spark/python/pyspark/sql/dataframe.pyc in collect(self)
437 """
438 with SCCallSiteSync(self._sc) as css:
--> 439 port = self._jdf.collectToPython()
440 return list(_load_from_socket(port, 
BatchedSerializer(PickleSerializer(
441 

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 
__call__(self, *args)
   1131 answer = self.gateway_client.send_command(command)
   1132 return_value = get_return_value(
-> 1133 answer, self.gateway_client, self.target_id, self.name)
   1134 
   1135 for temp_arg in temp_args:

/databricks/spark/python/pyspark/sql/utils.pyc in deco(*a, **kw)
 61 def deco(*a, **kw):
 62 try:
---> 63 return f(*a, **kw)
 64 except py4j.protocol.Py4JJavaError as e:
 65 s = e.java_exception.toString()

/databricks/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in 
get_return_value(answer, gateway_client, target_id, name)
317 raise Py4JJavaError(
318 "An error occurred while calling {0}{1}{2}.\n".
--> 319 format(target_id, ".", name), value)
320 else:
321 raise Py4JError(

Py4JJavaError: An error occurred while calling o204.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
stage 161.0 failed 4 times, most recent failure: Lost task 0.3 in stage 161.0 
(TID 433, 10.0.195.33, executor 0): 
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 177, in main
process()
  File "/databricks/spark/python/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/databricks/spark/python/pyspark/serializers.py", line 285, in 
dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/databricks/spark/python/pyspark/sql/session.py", line 520, in prepare
verify_func(obj, schema)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1458, in 
_verify_type
_verify_type(v, f.dataType, f.nullable)
  File "/databricks/spark/python/pyspark/sql/types.py", line 1422, in 
_verify_type
raise TypeError("%s can not accept object %r in type %s" % (dataType, obj, 
type(obj)))
TypeError: FloatType can not accept object 2 in type 

at 
org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at 
org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.r

[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  # Putting fields in alphabetical order masks the issue
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   # Putting fields in alphabetical order masks the issue
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:python}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:python}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:none}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{code:python}
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
{code}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {code:none}
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because {{Row.__new__}} sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because {{Row.__new__}} sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should 
be accessed by field name, not by position because `Row.__new__` sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie {{Row(**kwargs)}}) should 
> be accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) should 
be accessed by field name, not by position because `Row.__new__` sorts the 
fields alphabetically by name. It seems like this promise is not being honored 
when these Row objects are shuffled. I've included an example to help reproduce 
the issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}


  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
}}



> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> bq. The fields in a Row object created from a dict (ie `Row(**kwargs)`) 
> should be accessed by field name, not by position because `Row.__new__` sorts 
> the fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> }}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



{{from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)}}


  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```


> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> {{from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)}}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-22232:
---

 Summary: Row objects in pyspark using the `Row(**kwars)` syntax do 
not get serialized/deserialized properly
 Key: SPARK-22232
 URL: https://issues.apache.org/jira/browse/SPARK-22232
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.2.0
Reporter: Bago Amirbekian


The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.


```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22232) Row objects in pyspark using the `Row(**kwars)` syntax do not get serialized/deserialized properly

2017-10-09 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-22232:

Description: 
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.



```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```

  was:
The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
accessed by field name, not by position because `Row.__new__` sorts the fields 
alphabetically by name. It seems like this promise is not being honored when 
these Row objects are shuffled. I've included an example to help reproduce the 
issue.


```
from pyspark.sql.types import *
from pyspark.sql import *

def toRow(i):
  return Row(a="a", c=3.0, b=2)

schema = StructType([
  StructField("a", StringType(),  False),
  StructField("c", FloatType(), False),
  StructField("b", IntegerType(), False),
])

rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))

# As long as we don't shuffle things work fine.
print rdd.toDF(schema).take(2)

# If we introduce a shuffle we have issues
print rdd.repartition(3).toDF(schema).take(2)
```


> Row objects in pyspark using the `Row(**kwars)` syntax do not get 
> serialized/deserialized properly
> --
>
> Key: SPARK-22232
> URL: https://issues.apache.org/jira/browse/SPARK-22232
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> The fields in a Row object created from a dict (ie `Row(**kwargs)`) should be 
> accessed by field name, not by position because `Row.__new__` sorts the 
> fields alphabetically by name. It seems like this promise is not being 
> honored when these Row objects are shuffled. I've included an example to help 
> reproduce the issue.
> ```
> from pyspark.sql.types import *
> from pyspark.sql import *
> def toRow(i):
>   return Row(a="a", c=3.0, b=2)
> schema = StructType([
>   StructField("a", StringType(),  False),
>   StructField("c", FloatType(), False),
>   StructField("b", IntegerType(), False),
> ])
> rdd = sc.parallelize(range(10)).repartition(2).map(lambda i: toRow(i))
> # As long as we don't shuffle things work fine.
> print rdd.toDF(schema).take(2)
> # If we introduce a shuffle we have issues
> print rdd.repartition(3).toDF(schema).take(2)
> ```



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform streaming dataframes

2017-10-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193392#comment-16193392
 ] 

Bago Amirbekian commented on SPARK-21926:
-

[~mslipper] The trickiest thing about 1 (b) is knowing how to test that it 
won't change behaviour. I'd like run this past some folks with more MLlib 
experience to see if there are any obvious issues with this approach that we 
haven't considered.

> Some transformers in spark.ml.feature fail when trying to transform streaming 
> dataframes
> 
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> a) Re-design vectorUDT metadata to support missing metadata for some 
> elements. (This might be a good thing to do anyways SPARK-19141)
> b) drop metadata in streaming context.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13030) Change OneHotEncoder to Estimator

2017-10-05 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16193389#comment-16193389
 ] 

Bago Amirbekian commented on SPARK-13030:
-

Just so I'm clear, does multi-column in this context mean apply one-hot-encoder 
to each column and then join the resulting vectors?

How do you all feel about giving the new OneHotEncoder the same `handleInvalid` 
semantics as StringIndexer?

> Change OneHotEncoder to Estimator
> -
>
> Key: SPARK-13030
> URL: https://issues.apache.org/jira/browse/SPARK-13030
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.6.0
>Reporter: Wojciech Jurczyk
>
> OneHotEncoder should be an Estimator, just like in scikit-learn 
> (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html).
> In its current form, it is impossible to use when number of categories is 
> different between training dataset and test dataset.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21926) Some transformers in spark.ml.feature fail when trying to transform steaming dataframes

2017-09-05 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-21926:
---

 Summary: Some transformers in spark.ml.feature fail when trying to 
transform steaming dataframes
 Key: SPARK-21926
 URL: https://issues.apache.org/jira/browse/SPARK-21926
 Project: Spark
  Issue Type: Bug
  Components: ML, Structured Streaming
Affects Versions: 2.2.0
Reporter: Bago Amirbekian


We've run into a few cases where ML components don't play nice with streaming 
dataframes (for prediction). This ticket is meant to help aggregate these known 
cases in one place and provide a place to discuss possible fixes.

Failing cases:
1) VectorAssembler where one of the inputs is a VectorUDT column with no 
metadata.
Possible fixes:
a) Re-design vectorUDT metadata to support missing metadata for some elements. 
(This might be a good thing to do anyways SPARK-19141)
b) drop metadata in streaming context.

2) OneHotEncoder where the input is a column with no metadata.
Possible fixes:
a) Make OneHotEncoder an estimator (SPARK-13030).
b) Allow user to set the cardinality of OneHotEncoder.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20862) LogisticRegressionModel throws TypeError

2017-05-23 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-20862:
---

 Summary: LogisticRegressionModel throws TypeError
 Key: SPARK-20862
 URL: https://issues.apache.org/jira/browse/SPARK-20862
 Project: Spark
  Issue Type: Bug
  Components: MLlib, PySpark
Affects Versions: 2.1.1
Reporter: Bago Amirbekian
Priority: Minor


LogisticRegressionModel throws a TypeError using python3 and numpy 1.12.1:

**
File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", line 
155, in __main__.LogisticRegressionModel
Failed example:
mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, numClasses=3)
Exception raised:
Traceback (most recent call last):
  File 
"/usr/local/Cellar/python3/3.6.1/Frameworks/Python.framework/Versions/3.6/lib/python3.6/doctest.py",
 line 1330, in __run
compileflags, 1), test.globs)
  File "", line 1, in 
mcm = LogisticRegressionWithLBFGS.train(data, iterations=10, 
numClasses=3)
  File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
line 398, in train
return _regression_train_wrapper(train, LogisticRegressionModel, data, 
initialWeights)
  File "/Users/bago/repos/spark/python/pyspark/mllib/regression.py", line 
216, in _regression_train_wrapper
return modelClass(weights, intercept, numFeatures, numClasses)
  File "/Users/bago/repos/spark/python/pyspark/mllib/classification.py", 
line 176, in __init__
self._dataWithBiasSize)
TypeError: 'float' object cannot be interpreted as an integer




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022137#comment-16022137
 ] 

Bago Amirbekian commented on SPARK-20861:
-

[~josephkb]

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bago Amirbekian updated SPARK-20861:

Comment: was deleted

(was: I've made a PR to address this issue: 
https://github.com/apache/spark/pull/18077.)

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022134#comment-16022134
 ] 

Bago Amirbekian commented on SPARK-20861:
-

I've made a PR to address this issue: 
https://github.com/apache/spark/pull/18077.

> Pyspark CrossValidator & TrainValidationSplit should delegate parameter 
> looping to estimators
> -
>
> Key: SPARK-20861
> URL: https://issues.apache.org/jira/browse/SPARK-20861
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.1.1
>Reporter: Bago Amirbekian
>Priority: Minor
>
> The CrossValidator & TrainValidationSplit should call estimator.fit with all 
> their parameter maps instead of passing params one by one to fit. This 
> behaviour would make Python spark more consistent with Scala spark and allow 
> individual to parallelize or optimize for fitting over multiple parameter 
> maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20861) Pyspark CrossValidator & TrainValidationSplit should delegate parameter looping to estimators

2017-05-23 Thread Bago Amirbekian (JIRA)
Bago Amirbekian created SPARK-20861:
---

 Summary: Pyspark CrossValidator & TrainValidationSplit should 
delegate parameter looping to estimators
 Key: SPARK-20861
 URL: https://issues.apache.org/jira/browse/SPARK-20861
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.1.1
Reporter: Bago Amirbekian
Priority: Minor


The CrossValidator & TrainValidationSplit should call estimator.fit with all 
their parameter maps instead of passing params one by one to fit. This 
behaviour would make Python spark more consistent with Scala spark and allow 
individual to parallelize or optimize for fitting over multiple parameter maps.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20040) Python API for ml.stat.ChiSquareTest

2017-03-22 Thread Bago Amirbekian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20040?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936902#comment-15936902
 ] 

Bago Amirbekian commented on SPARK-20040:
-

I'd like to work on this.

> Python API for ml.stat.ChiSquareTest
> 
>
> Key: SPARK-20040
> URL: https://issues.apache.org/jira/browse/SPARK-20040
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>
> Add PySpark wrapper for ChiSquareTest.  Note that it's currently called 
> ChiSquare, but I'm about to rename it to ChiSquareTest in [SPARK-20039]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org