[jira] [Resolved] (SPARK-23758) MLlib 2.4 Roadmap

2019-07-19 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23758.
---
Resolution: Done

> MLlib 2.4 Roadmap
> -
>
> Key: SPARK-23758
> URL: https://issues.apache.org/jira/browse/SPARK-23758
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
>  | (empty) | (any) | yes | no | maybe |
> | [7 | 
> 

[jira] [Updated] (SPARK-23758) MLlib 2.4 Roadmap

2019-07-19 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23758:
--
Affects Version/s: (was: 3.0.0)
   2.4.0

> MLlib 2.4 Roadmap
> -
>
> Key: SPARK-23758
> URL: https://issues.apache.org/jira/browse/SPARK-23758
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
>  | (empty) | (any) | yes | no | maybe |
> | [7 | 
> 

[jira] [Commented] (SPARK-23758) MLlib 2.4 Roadmap

2019-07-19 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16889240#comment-16889240
 ] 

Joseph K. Bradley commented on SPARK-23758:
---

Ah sorry, we stopped using this.  I'll close it.

> MLlib 2.4 Roadmap
> -
>
> Key: SPARK-23758
> URL: https://issues.apache.org/jira/browse/SPARK-23758
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 3.0.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> * Vote for & watch issues which are important to you.
> ** MLlib, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC]
> ** SparkR, sorted by: [Votes | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20votes%20DESC]
>  or [Watchers | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(SparkR)%20ORDER%20BY%20Watchers%20DESC]
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | [1 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Blocker%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Blocker | *must* | *must* | *must* |
> | [2 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Critical%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Critical | *must* | yes, unless small | *best effort* |
> | [3 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Major%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Major | *must* | optional | *best effort* |
> | [4 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Minor%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Minor | optional | no | maybe |
> | [5 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20priority%20%3D%20Trivial%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20%22Target%20Version%2Fs%22%20in%20(2.4.0%2C%203.0.0)]
>  | next release | Trivial | optional | no | maybe |
> | [6 | 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20"In%20Progress"%2C%20Reopened)%20AND%20component%20in%20(GraphX%2C%20ML%2C%20MLlib%2C%20SparkR)%20AND%20"Target%20Version%2Fs"%20in%20(EMPTY)%20AND%20Shepherd%20not%20in%20(EMPTY)%20ORDER%20BY%20priority%20DESC]
>  | (empty) | (any) | yes | no | maybe |
> | [7 | 
> 

[jira] [Updated] (SPARK-26960) Reduce flakiness of Spark ML Listener test suite by waiting for listener bus to clear

2019-02-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-26960:
--
Description: 
[SPARK-23674] added SparkListeners for some spark.ml events, as well as a test 
suite.  I've observed flakiness in the test suite (though I unfortunately only 
have local logs for failures, not permalinks).  Looking at logs, here's my 
assessment, which I confirmed with [~zsxwing].

Test failure I saw:
{code}
[info] - pipeline read/write events *** FAILED *** (10 seconds, 253 
milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 20 
times over 10.01409 seconds. Last failure message: Unexpected event thrown: 
SaveInstanceEnd(org.apache.spark.ml.util.DefaultParamsWriter@6daf89c2,/home/jenkins/workspace/mllib/target/tmp/spark-572952d8-5d2d-4a6c-bd4f-940bb0bbc3d5/pipeline/stages/0_writableStage).
 (MLEventsSuite.scala:190)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:190)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:165)
[info]   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:314)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply$mcV$sp(MLEventsSuite.scala:165)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:284)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:184)
[info]   at 
org.apache.spark.SparkFunSuite.runWithCredentials(SparkFunSuite.scala:301)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:165)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1147)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at org.scalatest.FunSuiteLike$class.run(FunSuiteLike.scala:233)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterAll$class.liftedTree1$1(BeforeAndAfterAll.scala:213)
[info]   at 

[jira] [Created] (SPARK-26960) Reduce flakiness of Spark ML Listener test suite by waiting for listener bus to clear

2019-02-21 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-26960:
-

 Summary: Reduce flakiness of Spark ML Listener test suite by 
waiting for listener bus to clear
 Key: SPARK-26960
 URL: https://issues.apache.org/jira/browse/SPARK-26960
 Project: Spark
  Issue Type: Improvement
  Components: ML, Tests
Affects Versions: 3.0.0
Reporter: Joseph K. Bradley
Assignee: Joseph K. Bradley


[SPARK-23674] added SparkListeners for some spark.ml events, as well as a test 
suite.  I've observed flakiness in the test suite (though I unfortunately only 
have local logs for failures, not permalinks).  Looking at logs, here's my 
assessment, which I confirmed with [~zsxwing].

Test failure I saw:
{code}
[info] - pipeline read/write events *** FAILED *** (10 seconds, 253 
milliseconds)
[info]   The code passed to eventually never returned normally. Attempted 20 
times over 10.01409 seconds. Last failure message: Unexpected event thrown: 
SaveInstanceEnd(org.apache.spark.ml.util.DefaultParamsWriter@6daf89c2,/home/jenkins/workspace/mllib/target/tmp/spark-572952d8-5d2d-4a6c-bd4f-940bb0bbc3d5/pipeline/stages/0_writableStage).
 (MLEventsSuite.scala:190)
[info]   org.scalatest.exceptions.TestFailedDueToTimeoutException:
[info]   at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:308)
[info]   at org.apache.spark.ml.MLEventsSuite.eventually(MLEventsSuite.scala:39)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:190)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(MLEventsSuite.scala:165)
[info]   at org.apache.spark.SparkFunSuite.withTempDir(SparkFunSuite.scala:314)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply$mcV$sp(MLEventsSuite.scala:165)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.ml.MLEventsSuite$$anonfun$1.apply(MLEventsSuite.scala:161)
[info]   at 
org.apache.spark.SparkFunSuite$$anonfun$test$1.apply(SparkFunSuite.scala:284)
[info]   at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
[info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
[info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
[info]   at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:184)
[info]   at 
org.apache.spark.SparkFunSuite.runWithCredentials(SparkFunSuite.scala:301)
[info]   at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:165)
[info]   at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:183)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:196)
[info]   at org.scalatest.SuperEngine.runTestImpl(Engine.scala:289)
[info]   at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:196)
[info]   at 
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.BeforeAndAfterEach$class.runTest(BeforeAndAfterEach.scala:221)
[info]   at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:64)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:229)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:396)
[info]   at 
org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:384)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:384)
[info]   at 
org.scalatest.SuperEngine.org$scalatest$SuperEngine$$runTestsInBranch(Engine.scala:379)
[info]   at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:461)
[info]   at org.scalatest.FunSuiteLike$class.runTests(FunSuiteLike.scala:229)
[info]   at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
[info]   at org.scalatest.Suite$class.run(Suite.scala:1147)
[info]   at 
org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at 
org.scalatest.FunSuiteLike$$anonfun$run$1.apply(FunSuiteLike.scala:233)
[info]   at org.scalatest.SuperEngine.runImpl(Engine.scala:521)
[info]   at 

[jira] [Updated] (SPARK-25994) SPIP: Property Graphs, Cypher Queries, and Algorithms

2018-11-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25994:
--
Labels: SPIP  (was: )

> SPIP: Property Graphs, Cypher Queries, and Algorithms
> -
>
> Key: SPARK-25994
> URL: https://issues.apache.org/jira/browse/SPARK-25994
> Project: Spark
>  Issue Type: New Feature
>  Components: GraphX
>Affects Versions: 3.0.0
>Reporter: Xiangrui Meng
>Assignee: Martin Junghanns
>Priority: Major
>  Labels: SPIP
>
> Copied from the SPIP doc:
> {quote}
> GraphX was one of the foundational pillars of the Spark project, and is the 
> current graph component. This reflects the importance of the graphs data 
> model, which naturally pairs with an important class of analytic function, 
> the network or graph algorithm. 
> However, GraphX is not actively maintained. It is based on RDDs, and cannot 
> exploit Spark 2’s Catalyst query engine. GraphX is only available to Scala 
> users.
> GraphFrames is a Spark package, which implements DataFrame-based graph 
> algorithms, and also incorporates simple graph pattern matching with fixed 
> length patterns (called “motifs”). GraphFrames is based on DataFrames, but 
> has a semantically weak graph data model (based on untyped edges and 
> vertices). The motif pattern matching facility is very limited by comparison 
> with the well-established Cypher language. 
> The Property Graph data model has become quite widespread in recent years, 
> and is the primary focus of commercial graph data management and of graph 
> data research, both for on-premises and cloud data management. Many users of 
> transactional graph databases also wish to work with immutable graphs in 
> Spark.
> The idea is to define a Cypher-compatible Property Graph type based on 
> DataFrames; to replace GraphFrames querying with Cypher; to reimplement 
> GraphX/GraphFrames algos on the PropertyGraph type. 
> To achieve this goal, a core subset of Cypher for Apache Spark (CAPS), 
> reusing existing proven designs and code, will be employed in Spark 3.0. This 
> graph query processor, like CAPS, will overlay and drive the SparkSQL 
> Catalyst query engine, using the CAPS graph query planner.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25324) ML 2.4 QA: API: Java compatibility, docs

2018-09-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25324:
--
Fix Version/s: 2.4.0

> ML 2.4 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-25324
> URL: https://issues.apache.org/jira/browse/SPARK-25324
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25320) ML, Graph 2.4 QA: API: Binary incompatible changes

2018-09-28 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25320:
--
Fix Version/s: 2.4.0

> ML, Graph 2.4 QA: API: Binary incompatible changes
> --
>
> Key: SPARK-25320
> URL: https://issues.apache.org/jira/browse/SPARK-25320
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Blocker
> Fix For: 2.4.0
>
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, look at the analogous task from the previous 
> release QA, and ping the Assignee for advice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-17 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16618118#comment-16618118
 ] 

Joseph K. Bradley commented on SPARK-25321:
---

[~WeichenXu123] Have you been able to look into reverting those changes or 
discussed with [~mengxr] about reverting them?  Thanks!

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25321) ML, Graph 2.4 QA: API: New Scala APIs, docs

2018-09-12 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16612436#comment-16612436
 ] 

Joseph K. Bradley commented on SPARK-25321:
---

You're right; these are breaking changes.  If we're sticking with the rules, 
then we should revert these in branch-2.4, but we could keep them in master if 
the next release is 3.0.  Is it easy to revert these PRs, or have they 
collected conflicts by now?

> ML, Graph 2.4 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-25321
> URL: https://issues.apache.org/jira/browse/SPARK-25321
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.4.0
>Reporter: Weichen Xu
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA issue



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16609593#comment-16609593
 ] 

Joseph K. Bradley commented on SPARK-25397:
---

CC [~smilegator], [~cloud_fan] for visibility

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25397?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25397:
--
Priority: Minor  (was: Major)

> SparkSession.conf fails when given default value with Python 3
> --
>
> Key: SPARK-25397
> URL: https://issues.apache.org/jira/browse/SPARK-25397
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
> SparkSession when you give non-string default values.  Reproduce via 
> SparkSession call:
> {{spark.conf.get("myConf", False)}}
> This gives the error:
> {code}
> >>> spark.conf.get("myConf", False)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 51, in get
> self._checkType(default, "default")
>   File 
> "/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
>  line 62, in _checkType
> if not isinstance(obj, str) and not isinstance(obj, unicode):
> *NameError: name 'unicode' is not defined*
> {code}
> The offending line in Spark in branch-2.3 is: 
> https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
> which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25397) SparkSession.conf fails when given default value with Python 3

2018-09-10 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-25397:
-

 Summary: SparkSession.conf fails when given default value with 
Python 3
 Key: SPARK-25397
 URL: https://issues.apache.org/jira/browse/SPARK-25397
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.1
Reporter: Joseph K. Bradley


Spark 2.3.1 has a Python 3 incompatibility when requesting a Conf value from 
SparkSession when you give non-string default values.  Reproduce via 
SparkSession call:
{{spark.conf.get("myConf", False)}}

This gives the error:
{code}
>>> spark.conf.get("myConf", False)
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
 line 51, in get
self._checkType(default, "default")
  File 
"/Users/josephkb/work/spark-bin/spark-2.3.1-bin-hadoop2.7/python/pyspark/sql/conf.py",
 line 62, in _checkType
if not isinstance(obj, str) and not isinstance(obj, unicode):
*NameError: name 'unicode' is not defined*
{code}

The offending line in Spark in branch-2.3 is: 
https://github.com/apache/spark/blob/branch-2.3/python/pyspark/sql/conf.py 
which uses the value {{unicode}} which is not available in Python 3.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-09-06 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-25268.
---
   Resolution: Fixed
Fix Version/s: 2.4.0
   3.0.0

Issue resolved by pull request 22271
[https://github.com/apache/spark/pull/22271]

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: shahid
>Priority: Critical
> Fix For: 3.0.0, 2.4.0
>
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> 

[jira] [Assigned] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-09-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-25268:
-

Assignee: shahid

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: shahid
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at 

[jira] [Updated] (SPARK-25268) runParallelPersonalizedPageRank throws serialization Exception

2018-09-05 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25268:
--
Shepherd: Joseph K. Bradley

> runParallelPersonalizedPageRank throws serialization Exception
> --
>
> Key: SPARK-25268
> URL: https://issues.apache.org/jira/browse/SPARK-25268
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Priority: Critical
>
> A recent change to PageRank introduced a bug in the 
> ParallelPersonalizedPageRank implementation. The change prevents 
> serialization of a Map which needs to be broadcast to all workers. The issue 
> is in this line here: 
> [https://github.com/apache/spark/blob/6c5cb85856235efd464b109558896f81ae2c4c75/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala#L201]
> Because graphx units tests are run in local mode, the Serialization issue is 
> not caught.
>  
> {code:java}
> [info] - Star example parallel personalized PageRank *** FAILED *** (2 
> seconds, 160 milliseconds)
> [info] java.io.NotSerializableException: 
> scala.collection.immutable.MapLike$$anon$2
> [info] Serialization stack:
> [info] - object not serializable (class: 
> scala.collection.immutable.MapLike$$anon$2, value: Map(1 -> 
> SparseVector(3)((0,1.0)), 2 -> SparseVector(3)((1,1.0)), 3 -> 
> SparseVector(3)((2,1.0
> [info] at 
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> [info] at 
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$blockifyObject$2.apply(TorrentBroadcast.scala:291)
> [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1348)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast$.blockifyObject(TorrentBroadcast.scala:292)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:127)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
> [info] at 
> org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
> [info] at 
> org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
> [info] at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
> [info] at 
> org.apache.spark.graphx.lib.PageRank$.runParallelPersonalizedPageRank(PageRank.scala:205)
> [info] at 
> org.apache.spark.graphx.lib.GraphXHelpers$.runParallelPersonalizedPageRank(GraphXHelpers.scala:31)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank$.run(ParallelPersonalizedPageRank.scala:115)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRank.run(ParallelPersonalizedPageRank.scala:84)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply$mcV$sp(ParallelPersonalizedPageRankSuite.scala:62)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.graphframes.lib.ParallelPersonalizedPageRankSuite$$anonfun$2.apply(ParallelPersonalizedPageRankSuite.scala:51)
> [info] at 
> org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
> [info] at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
> [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:22)
> [info] at org.scalatest.Transformer.apply(Transformer.scala:20)
> [info] at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
> [info] at org.graphframes.SparkFunSuite.withFixture(SparkFunSuite.scala:40)
> [info] at 
> org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
> [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
> [info] at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
> [info] at org.scalatest.FunSuite.runTest(FunSuite.scala:1555)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.FunSuiteLike$$anonfun$runTests$1.apply(FunSuiteLike.scala:208)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:413)
> [info] at 
> org.scalatest.SuperEngine$$anonfun$traverseSubNodes$1$1.apply(Engine.scala:401)
> [info] at scala.collection.immutable.List.foreach(List.scala:383)
> [info] at 

[jira] [Resolved] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-24 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-25124.
---
   Resolution: Fixed
Fix Version/s: 2.3.2

Issue resolved by pull request 8
[https://github.com/apache/spark/pull/8]

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
> Fix For: 2.4.0, 2.3.2
>
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25124:
--
Target Version/s: 2.3.2

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
> Fix For: 2.4.0
>
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590919#comment-16590919
 ] 

Joseph K. Bradley commented on SPARK-25124:
---

I merged https://github.com/apache/spark/pull/22136 into master for target 
2.4.0.  I'll leave this open until we backport it to branch-2.3

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
> Fix For: 2.4.0
>
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25124:
--
Target Version/s: 2.3.2, 2.4.0  (was: 2.3.2)

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
> Fix For: 2.4.0
>
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25124:
--
Fix Version/s: 2.4.0

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
> Fix For: 2.4.0
>
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25124:
--
Shepherd: Joseph K. Bradley

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-23 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-25124:
-

Assignee: Huaxin Gao

> VectorSizeHint.size is buggy, breaking streaming pipeline
> -
>
> Key: SPARK-25124
> URL: https://issues.apache.org/jira/browse/SPARK-25124
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Timothy Hunter
>Assignee: Huaxin Gao
>Priority: Major
>  Labels: beginner, starter
>
> Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
> transforming a stream will return a nondescript exception about the stream 
> not started. At core are the following bugs that {{setSize}} and {{getSize}} 
> do not {{return}} values but {{None}}:
> https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846
> How to reproduce, using the example in the doc:
> {code}
> from pyspark.ml.linalg import Vectors
> from pyspark.ml import Pipeline, PipelineModel
> from pyspark.ml.feature import VectorAssembler, VectorSizeHint
> data = [(Vectors.dense([1., 2., 3.]), 4.)]
> df = spark.createDataFrame(data, ["vector", "float"])
> sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) 
> # Will fail
> vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
> outputCol="assembled")
> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
> pipelineModel = pipeline.fit(df)
> pipelineModel.transform(df).head().assembled
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-25149.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 22139
[https://github.com/apache/spark/pull/22139]

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-25149:
-

Assignee: Bago Amirbekian

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25149) Personalized PageRank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25149:
--
Summary: Personalized PageRank raises an error if vertexIDs are > MaxInt  
(was: Personalized Page Rank raises an error if vertexIDs are > MaxInt)

> Personalized PageRank raises an error if vertexIDs are > MaxInt
> ---
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25149) Personalized Page Rank raises an error if vertexIDs are > MaxInt

2018-08-21 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25149?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-25149:
--
Shepherd: Joseph K. Bradley

> Personalized Page Rank raises an error if vertexIDs are > MaxInt
> 
>
> Key: SPARK-25149
> URL: https://issues.apache.org/jira/browse/SPARK-25149
> Project: Spark
>  Issue Type: Bug
>  Components: GraphX
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Priority: Major
>
> Looking at the implementation I think we don't actually need this check. The 
> current implementation indexes the sparse vectors used and returned by the 
> method are index by the _position_ of the vertexId in `sources` not by the 
> vertex ID itself. We should remove this check and add a test to confirm the 
> implementation works with large vertex IDs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-08-01 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24632:
-

Assignee: (was: Joseph K. Bradley)

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-08-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565769#comment-16565769
 ] 

Joseph K. Bradley commented on SPARK-24632:
---

I'm unassigning myself since I don't have time to work on this right now.  : (

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-08-01 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16565767#comment-16565767
 ] 

Joseph K. Bradley commented on SPARK-24632:
---

That's a good point.  Let's do it your way.  : )
You're right that putting this knowledge of wrapper classpaths on the Python 
side is better organized.  That will allow users to wrap Scala classes later 
without breaking APIs (by adding new mix-ins).

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-20 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24852.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21799
[https://github.com/apache/spark/pull/21799]

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24852:
--
Shepherd: Joseph K. Bradley

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24852) Have spark.ml training use updated `Instrumentation` APIs.

2018-07-18 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24852:
-

Assignee: Bago Amirbekian

> Have spark.ml training use updated `Instrumentation` APIs.
> --
>
> Key: SPARK-24852
> URL: https://issues.apache.org/jira/browse/SPARK-24852
> Project: Spark
>  Issue Type: Story
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
>
> Port spark.ml code to use the new methods on the `Instrumentation` class and 
> remove the old methods & constructor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24747) Make spark.ml.util.Instrumentation class more flexible

2018-07-17 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24747.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21719
[https://github.com/apache/spark/pull/21719]

> Make spark.ml.util.Instrumentation class more flexible
> --
>
> Key: SPARK-24747
> URL: https://issues.apache.org/jira/browse/SPARK-24747
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.1
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> The Instrumentation class (which is an internal private class) is some what 
> limited by it's current APIs. The class requires an estimator and dataset be 
> passed to the constructor which limits how it can be used. Furthermore, the 
> current APIs make it hard to intercept failures and record anything related 
> to those failures.
> The following changes could make the instrumentation class easier to work 
> with. All these changes are for private APIs and should not be visible to 
> users.
> {code}
> // New no-argument constructor:
> Instrumentation()
> // New api to log previous constructor arguments.
> logTrainingContext(estimator: Estimator[_], dataset: Dataset[_])
> logFailure(e: Throwable): Unit
> // Log success with no arguments
> logSuccess(): Unit
> // Log result model explicitly instead of passing to logSuccess
> logModel(model: Model[_]): Unit
> // On Companion object
> Instrumentation.instrumented[T](body: (Instrumentation => T)): T
> // The above API will allow us to write instrumented methods more clearly and 
> handle logging success and failure automatically:
> def someMethod(...): T = instrumented { instr =>
>   instr.logNamedValue(name, value)
>   // more code here
>   instr.logModel(model)
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-25 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16522557#comment-16522557
 ] 

Joseph K. Bradley commented on SPARK-24632:
---

CC [~yanboliang], [~holden.ka...@gmail.com]: You may be interested in this

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-25 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24632:
--
Description: 
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

I spent a bit thinking about this and wrote up thoughts and a proposal in the 
doc linked below.  Summary of proposal:

Require that 3rd-party libraries with Java classes with Python wrappers 
implement a trait which provides the corresponding Python classpath in some 
field:
{code}
trait PythonWrappable {
  def pythonClassPath: String = …
}
MyJavaType extends PythonWrappable
{code}
This will not be required for MLlib wrappers, which we can handle specially.

One issue for this task will be that we may have trouble writing unit tests.  
They would ideally test a Java class + Python wrapper class pair sitting 
outside of pyspark.

  was:
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

Some fixes we'll need include:
* an overridable method for converting between Python and Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
* 
https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.


> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> I spent a bit thinking about this and wrote up thoughts and a proposal in the 
> doc linked below.  Summary of proposal:
> Require that 3rd-party libraries with Java classes with Python wrappers 
> implement a trait which provides the corresponding Python classpath in some 
> field:
> {code}
> trait PythonWrappable {
>   def pythonClassPath: String = …
> }
> MyJavaType extends PythonWrappable
> {code}
> This will not be required for MLlib wrappers, which we can handle specially.
> One issue for this task will be that we may have trouble writing unit tests.  
> They would ideally test a Java class + Python wrapper class pair sitting 
> outside of pyspark.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24632:
-

Assignee: Joseph K. Bradley

> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> Some fixes we'll need include:
> * an overridable method for converting between Python and Java classpaths. 
> See 
> https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
> * 
> https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378
> One unusual thing for this task will be to write unit tests which test a 
> custom PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24632:
--
Description: 
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

Some fixes we'll need include:
* an overridable method for converting between Python and Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
* 
https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.

  was:
This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

One fix we'll need is an overridable method for converting between Python and 
Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.


> Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers 
> for persistence
> --
>
> Key: SPARK-24632
> URL: https://issues.apache.org/jira/browse/SPARK-24632
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is a follow-up for [SPARK-17025], which allowed users to implement 
> Python PipelineStages in 3rd-party libraries, include them in Pipelines, and 
> use Pipeline persistence.  This task is to make it easier for 3rd-party 
> libraries to have PipelineStages written in Java and then to use pyspark.ml 
> abstractions to create wrappers around those Java classes.  This is currently 
> possible, except that users hit bugs around persistence.
> Some fixes we'll need include:
> * an overridable method for converting between Python and Java classpaths. 
> See 
> https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284
> * 
> https://github.com/apache/spark/blob/4e7d8678a3d9b12797d07f5497e0ed9e471428dd/python/pyspark/ml/pipeline.py#L378
> One unusual thing for this task will be to write unit tests which test a 
> custom PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21926.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Marking fix version as 2.3.0 since the issues were fixed in 2.3, even though 
tests were not completed in time for 2.3.

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.3.0
>
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming.

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

  was:
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs, but LSH .

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.


> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming.
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520677#comment-16520677
 ] 

Joseph K. Bradley edited comment on SPARK-24465 at 6/22/18 6:39 PM:


Oh actually I think I made this by mistake or forgot to update it?  I fixed 
this in [SPARK-22883]


was (Author: josephkb):
Oh actually I think I made this by mistake?  I fixed this in [SPARK-22883]

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520677#comment-16520677
 ] 

Joseph K. Bradley commented on SPARK-24465:
---

Oh actually I think I made this by mistake?  I fixed this in [SPARK-22883]

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24465.
---
   Resolution: Fixed
 Assignee: Joseph K. Bradley
Fix Version/s: 2.3.1
   2.4.0

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs, but LSH .

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

  was:
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs; see 
[SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.


> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs, but LSH .
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520671#comment-16520671
 ] 

Joseph K. Bradley commented on SPARK-24465:
---

You're right; I did not read [SPARK-12878] carefully enough.  I'll update this 
JIRA's description to be more specific.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12878) Dataframe fails with nested User Defined Types

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-12878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-12878:
--
Description: 
Spark 1.6.0 crashes when using nested User Defined Types in a Dataframe. 
In version 1.5.2 the code below worked just fine:

{code}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.catalyst.InternalRow
import org.apache.spark.sql.catalyst.expressions.GenericMutableRow
import org.apache.spark.sql.types._

@SQLUserDefinedType(udt = classOf[AUDT])
case class A(list:Seq[B])

class AUDT extends UserDefinedType[A] {
  override def sqlType: DataType = StructType(Seq(StructField("list", 
ArrayType(BUDT, containsNull = false), nullable = true)))
  override def userClass: Class[A] = classOf[A]
  override def serialize(obj: Any): Any = obj match {
case A(list) =>
  val row = new GenericMutableRow(1)
  row.update(0, new GenericArrayData(list.map(_.asInstanceOf[Any]).toArray))
  row
  }

  override def deserialize(datum: Any): A = {
datum match {
  case row: InternalRow => new A(row.getArray(0).toArray(BUDT).toSeq)
}
  }
}

object AUDT extends AUDT

@SQLUserDefinedType(udt = classOf[BUDT])
case class B(text:Int)

class BUDT extends UserDefinedType[B] {
  override def sqlType: DataType = StructType(Seq(StructField("num", 
IntegerType, nullable = false)))
  override def userClass: Class[B] = classOf[B]
  override def serialize(obj: Any): Any = obj match {
case B(text) =>
  val row = new GenericMutableRow(1)
  row.setInt(0, text)
  row
  }

  override def deserialize(datum: Any): B = {
datum match {  case row: InternalRow => new B(row.getInt(0))  }
  }
}

object BUDT extends BUDT

object Test {
  def main(args:Array[String]) = {

val col = Seq(new A(Seq(new B(1), new B(2))),
  new A(Seq(new B(3), new B(4

val sc = new SparkContext(new 
SparkConf().setMaster("local[1]").setAppName("TestSpark"))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._

val df = sc.parallelize(1 to 2 zip col).toDF("id","b")
df.select("b").show()
df.collect().foreach(println)
  }
}
{code}

In the new version (1.6.0) I needed to include the following import:

`import org.apache.spark.sql.catalyst.expressions.GenericMutableRow`

However, Spark crashes in runtime:

{code}
16/01/18 14:36:22 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.ClassCastException: scala.runtime.BoxedUnit cannot be cast to 
org.apache.spark.sql.catalyst.InternalRow
at 
org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getStruct(rows.scala:51)
at 
org.apache.spark.sql.catalyst.expressions.GenericMutableRow.getStruct(rows.scala:248)
at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at 
org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at 
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at 
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at 
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at 
org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at 

[jira] [Commented] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520666#comment-16520666
 ] 

Joseph K. Bradley commented on SPARK-19498:
---

Sure, comments are welcome!  Or links to JIRAs, whichever are easier.

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-17025:
--
Comment: was deleted

(was: Thank you for your e-mail. I am on businees travel until Monday 2nd July 
so expect delays in my response.

Many thanks,

Pete

Dr Peter Knight
Sr Staff Analytics Engineer| UK Data Science
GE Aviation
AutoExtReply
)

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-17025:
-

Assignee: Ajay Saini

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Assignee: Ajay Saini
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-17025.
---
   Resolution: Fixed
Fix Version/s: 2.3.0

Fixed by linked JIRAs

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
> Fix For: 2.3.0
>
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24632) Allow 3rd-party libraries to use pyspark.ml abstractions for Java wrappers for persistence

2018-06-22 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24632:
-

 Summary: Allow 3rd-party libraries to use pyspark.ml abstractions 
for Java wrappers for persistence
 Key: SPARK-24632
 URL: https://issues.apache.org/jira/browse/SPARK-24632
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


This is a follow-up for [SPARK-17025], which allowed users to implement Python 
PipelineStages in 3rd-party libraries, include them in Pipelines, and use 
Pipeline persistence.  This task is to make it easier for 3rd-party libraries 
to have PipelineStages written in Java and then to use pyspark.ml abstractions 
to create wrappers around those Java classes.  This is currently possible, 
except that users hit bugs around persistence.

One fix we'll need is an overridable method for converting between Python and 
Java classpaths. See 
https://github.com/apache/spark/blob/b56e9c613fb345472da3db1a567ee129621f6bf3/python/pyspark/ml/util.py#L284

One unusual thing for this task will be to write unit tests which test a custom 
PipelineStage written outside of the pyspark package.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17025) Cannot persist PySpark ML Pipeline model that includes custom Transformer

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-17025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520646#comment-16520646
 ] 

Joseph K. Bradley commented on SPARK-17025:
---

We've tested it with Python-only implementations, and it works.  You end up 
hitting problems if you use the (sort of private) Python abstractions for Java 
wrappers like JavaMLWritable.  I'll create a follow-up issue for that and close 
this one as fixed.

> Cannot persist PySpark ML Pipeline model that includes custom Transformer
> -
>
> Key: SPARK-17025
> URL: https://issues.apache.org/jira/browse/SPARK-17025
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.0.0
>Reporter: Nicholas Chammas
>Priority: Minor
>
> Following the example in [this Databricks blog 
> post|https://databricks.com/blog/2016/05/31/apache-spark-2-0-preview-machine-learning-model-persistence.html]
>  under "Python tuning", I'm trying to save an ML Pipeline model.
> This pipeline, however, includes a custom transformer. When I try to save the 
> model, the operation fails because the custom transformer doesn't have a 
> {{_to_java}} attribute.
> {code}
> Traceback (most recent call last):
>   File ".../file.py", line 56, in 
> model.bestModel.save('model')
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 222, in save
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 217, in write
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/util.py",
>  line 93, in __init__
>   File 
> "/usr/local/Cellar/apache-spark/2.0.0/libexec/python/lib/pyspark.zip/pyspark/ml/pipeline.py",
>  line 254, in _to_java
> AttributeError: 'PeoplePairFeaturizer' object has no attribute '_to_java'
> {code}
> Looking at the source code for 
> [ml/base.py|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/base.py],
>  I see that not even the base Transformer class has such an attribute.
> I'm assuming this is missing functionality that is intended to be patched up 
> (i.e. [like 
> this|https://github.com/apache/spark/blob/acaf2a81ad5238fd1bc81e7be2c328f40c07e755/python/pyspark/ml/classification.py#L1421-L1433]).
> I'm not sure if there is an existing JIRA for this (my searches didn't turn 
> up clear results).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520635#comment-16520635
 ] 

Joseph K. Bradley commented on SPARK-4591:
--

There are still a few contained tasks which are incomplete.  I'd like to leave 
this open for now.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-22 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16520634#comment-16520634
 ] 

Joseph K. Bradley commented on SPARK-11107:
---

There are still lots of Transformers and Estimators which should support more 
types, but I"m OK closing this since I don't have time to work on it.

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11107) spark.ml should support more input column types: umbrella

2018-06-22 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-11107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11107.
---
Resolution: Done

> spark.ml should support more input column types: umbrella
> -
>
> Key: SPARK-11107
> URL: https://issues.apache.org/jira/browse/SPARK-11107
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for expanding the set of data types which spark.ml 
> Pipeline stages can take.  This should not involve breaking APIs, but merely 
> involve slight changes such as supporting all Numeric types instead of just 
> Double.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24467) VectorAssemblerEstimator

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511852#comment-16511852
 ] 

Joseph K. Bradley commented on SPARK-24467:
---

True, we would have to make the VectorAssembler inherit from Model.  Since 
VectorAssembler has public constructors and is not a final class, that would 
technically be a breaking change.  This might not work : (  ...until Spark 3.0. 
 That seems like a benign enough breaking change to put in a major Spark 
release.

> VectorAssemblerEstimator
> 
>
> Key: SPARK-24467
> URL: https://issues.apache.org/jira/browse/SPARK-24467
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
> `VectorSizeHint` instead of making `VectorAssembler` into an Estimator since 
> I thought the latter option would break most workflows.  However, I should 
> have proposed:
> * Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
> inputCols.  This Param can be optional.  If not given, then VectorAssembler 
> will behave as it does now.  If given, then VectorAssembler can use that info 
> instead of figuring out the Vector sizes via metadata or examining Rows in 
> the data (though it could do consistency checks).
> * Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
> produces a VectorAssembler with the vector lengths Param specified.
> This will not break existing workflows.  Migrating to 
> VectorAssemblerEstimator will be easier than adding VectorSizeHint since it 
> will not require users to manually input Vector lengths.
> Note: Even with this Estimator, VectorSizeHint might prove useful for other 
> things in the future which require vector length metadata, so we could 
> consider keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15882) Discuss distributed linear algebra in spark.ml package

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-15882?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511845#comment-16511845
 ] 

Joseph K. Bradley commented on SPARK-15882:
---

I'm afraid I don't have time to prioritize this work now.  It'd be awesome if 
others could pick it up, but I'm having to step back a bit from this work at 
this time.

> Discuss distributed linear algebra in spark.ml package
> --
>
> Key: SPARK-15882
> URL: https://issues.apache.org/jira/browse/SPARK-15882
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing how org.apache.spark.mllib.linalg.distributed.* 
> should be migrated to org.apache.spark.ml.
> Initial questions:
> * Should we use Datasets or RDDs underneath?
> * If Datasets, are there missing features needed for the migration?
> * Do we want to redesign any aspects of the distributed matrices during this 
> move?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3723) DecisionTree, RandomForest: Add more instrumentation

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3723?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3723:
-
Shepherd:   (was: Joseph K. Bradley)

> DecisionTree, RandomForest: Add more instrumentation
> 
>
> Key: SPARK-3723
> URL: https://issues.apache.org/jira/browse/SPARK-3723
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> Some simple instrumentation would help advanced users understand performance, 
> and to check whether parameters (such as maxMemoryInMB) need to be tuned.
> Most important instrumentation (simple):
> * min, avg, max nodes per group
> * number of groups (passes over data)
> More advanced instrumentation:
> * For each tree (or averaged over trees), training set accuracy after 
> training each level.  This would be useful for visualizing learning behavior 
> (to convince oneself that model selection was being done correctly).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3727) Trees and ensembles: More prediction functionality

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-3727?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-3727:
-
Shepherd:   (was: Joseph K. Bradley)

> Trees and ensembles: More prediction functionality
> --
>
> Key: SPARK-3727
> URL: https://issues.apache.org/jira/browse/SPARK-3727
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> DecisionTree and RandomForest currently predict the most likely label for 
> classification and the mean for regression.  Other info about predictions 
> would be useful.
> For classification: estimated probability of each possible label
> For regression: variance of estimate
> RandomForest could also create aggregate predictions in multiple ways:
> * Predict mean or median value for regression.
> * Compute variance of estimates (across all trees) for both classification 
> and regression.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5362) Gradient and Optimizer to support generic output (instead of label) and data batches

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5362:
-
Shepherd:   (was: Joseph K. Bradley)

> Gradient and Optimizer to support generic output (instead of label) and data 
> batches
> 
>
> Key: SPARK-5362
> URL: https://issues.apache.org/jira/browse/SPARK-5362
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 1.2.0
>Reporter: Alexander Ulanov
>Priority: Major
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Currently, Gradient and Optimizer interfaces support data in form of 
> RDD[Double, Vector] which refers to label and features. This limits its 
> application to classification problems. For example, artificial neural 
> network demands Vector as output (instead of label: Double). Moreover, 
> current interface does not support data batches. I propose to replace label: 
> Double with output: Vector. It enables passing generic output instead of 
> label and also passing data and output batches stored in corresponding 
> vectors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5556) Latent Dirichlet Allocation (LDA) using Gibbs sampler

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5556:
-
Shepherd:   (was: Joseph K. Bradley)

> Latent Dirichlet Allocation (LDA) using Gibbs sampler 
> --
>
> Key: SPARK-5556
> URL: https://issues.apache.org/jira/browse/SPARK-5556
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Guoqiang Li
>Assignee: Pedro Rodriguez
>Priority: Major
> Attachments: LDA_test.xlsx, spark-summit.pptx
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-4591:
-
Shepherd:   (was: Joseph K. Bradley)

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Shepherd:   (was: Joseph K. Bradley)

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9120) Add multivariate regression (or prediction) interface

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9120:
-
Shepherd:   (was: Joseph K. Bradley)

> Add multivariate regression (or prediction) interface
> -
>
> Key: SPARK-9120
> URL: https://issues.apache.org/jira/browse/SPARK-9120
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Alexander Ulanov
>Priority: Major
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> org.apache.spark.ml.regression.RegressionModel supports prediction only for a 
> single variable with a method "predict:Double" by extending the Predictor. 
> There is a need for multivariate prediction, at least for regression. I 
> propose to modify "RegressionModel" interface similarly to how it is done in 
> "ClassificationModel", which supports multiclass classification. It has 
> "predict:Double" and "predictRaw:Vector". Analogously, "RegressionModel" 
> should have something like "predictMultivariate:Vector".
> Update: After reading the design docs, adding "predictMultivariate" to 
> RegressionModel does not seem reasonable to me anymore. The issue is as 
> follows. RegressionModel has "predict:Double". Its "train" method uses 
> "predict:Double" for prediction, i.e. PredictionModel (and RegressionModel) 
> is hard-coded to have only one output. There exist a similar problem in MLLib 
> (https://issues.apache.org/jira/browse/SPARK-5362). 
> The possible solution for this problem might require to redesign the class 
> hierarchy or addition of a separate interface that extends model. Though the 
> latter means code duplication.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8767) Abstractions for InputColParam, OutputColParam

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8767:
-
Shepherd:   (was: Joseph K. Bradley)

> Abstractions for InputColParam, OutputColParam
> --
>
> Key: SPARK-8767
> URL: https://issues.apache.org/jira/browse/SPARK-8767
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>
> I'd like to create Param subclasses for output and input columns.  These will 
> provide easier schema checking, which could even be done automatically in an 
> abstraction rather than in each class.  That should simplify things for 
> developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Target Version/s:   (was: 3.0.0)

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7424) spark.ml classification, regression abstractions should add metadata to output column

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7424:
-
Shepherd:   (was: Joseph K. Bradley)

> spark.ml classification, regression abstractions should add metadata to 
> output column
> -
>
> Key: SPARK-7424
> URL: https://issues.apache.org/jira/browse/SPARK-7424
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Major
>
> Update ClassificationModel, ProbabilisticClassificationModel prediction to 
> include numClasses in output column metadata.
> Update RegressionModel to specify output column metadata as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21166) Automated ML persistence

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21166:
--
Shepherd:   (was: Joseph K. Bradley)

> Automated ML persistence
> 
>
> Key: SPARK-21166
> URL: https://issues.apache.org/jira/browse/SPARK-21166
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for discussing the possibility of automating ML persistence.  
> Currently, custom save/load methods are written for every Model.  However, we 
> could design a mixin which provides automated persistence, inspecting model 
> data and Params and reading/writing (known types) automatically.  This was 
> brought up in discussions with developers behind 
> https://github.com/azure/mmlspark
> Some issues we will need to consider:
> * Providing generic mixin usable in most or all cases
> * Handling corner cases (strange Param types, etc.)
> * Backwards compatibility (loading models saved by old Spark versions)
> Because of backwards compatibility in particular, it may make sense to 
> implement testing for that first, before we try to address automated 
> persistence: [SPARK-15573]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14585) Provide accessor methods for Pipeline stages

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-14585:
--
Shepherd:   (was: Joseph K. Bradley)

> Provide accessor methods for Pipeline stages
> 
>
> Key: SPARK-14585
> URL: https://issues.apache.org/jira/browse/SPARK-14585
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> It is currently hard to access particular stages in a Pipeline or 
> PipelineModel.  Some accessor methods would help.
> Scala:
> {code}
> class Pipeline {
>   /** Returns stage at index i in Pipeline */
>   def getStage[T <: PipelineStage](i: Int): T
>   /** Returns all stages of this type */
>   def getStagesOfType[T <: PipelineStage]: Array[T]
> }
> class PipelineModel {
>   /** Returns stage at index i in PipelineModel */
>   def getStage[T <: Transformer](i: Int): T
>   /**
>* Returns stage given its parent or generating instance in PipelineModel.
>* E.g., if this PipelineModel was created from a Pipeline containing a 
> stage 
>* {{myStage}}, then passing {{myStage}} to this method will return the
>* corresponding stage in this PipelineModel.
>*/
>   def getStage[T <: Transformer][implicit E <: PipelineStage](stage: E): T
>   /** Returns all stages of this type */
>   def getStagesOfType[T <: Transformer]: Array[T]
> }
> {code}
> These methods should not be recursive for now.  I.e., if a Pipeline A 
> contains another Pipeline B, then calling {{getStage}} on the outer Pipeline 
> A should not search for the stage within Pipeline B.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19591) Add sample weights to decision trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19591:
--
Shepherd:   (was: Joseph K. Bradley)

> Add sample weights to decision trees
> 
>
> Key: SPARK-19591
> URL: https://issues.apache.org/jira/browse/SPARK-19591
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Seth Hendrickson
>Assignee: Seth Hendrickson
>Priority: Major
>
> Add sample weights to decision trees.  See [SPARK-9478] for details on the 
> design.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9140) Replace TimeTracker by Stopwatch

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-9140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9140:
-
Shepherd:   (was: Joseph K. Bradley)

> Replace TimeTracker by Stopwatch
> 
>
> Key: SPARK-9140
> URL: https://issues.apache.org/jira/browse/SPARK-9140
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, MLlib
>Affects Versions: 1.5.0
>Reporter: Xiangrui Meng
>Priority: Minor
>
> We can replace TImeTracker in tree implementations by Stopwatch. The initial 
> PR could use local stopwatches only.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15573) Backwards-compatible persistence for spark.ml

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15573:
--
Shepherd:   (was: Joseph K. Bradley)

> Backwards-compatible persistence for spark.ml
> -
>
> Key: SPARK-15573
> URL: https://issues.apache.org/jira/browse/SPARK-15573
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This JIRA is for imposing backwards-compatible persistence for the 
> DataFrames-based API for MLlib.  I.e., we want to be able to load models 
> saved in previous versions of Spark.  We will not require loading models 
> saved in later versions of Spark.
> This requires:
> * Putting unit tests in place to check loading models from previous versions
> * Notifying all committers active on MLlib to be aware of this requirement in 
> the future
> The unit tests could be written as in spark.mllib, where we essentially 
> copied and pasted the save() code every time it changed.  This happens 
> rarely, so it should be acceptable, though other designs are fine.
> Subtasks of this JIRA should cover checking and adding tests for existing 
> cases, such as KMeansModel (whose format changed between 1.6 and 2.0).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-19498) Discussion: Making MLlib APIs extensible for 3rd party libraries

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-19498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-19498:
--
Shepherd:   (was: Joseph K. Bradley)

> Discussion: Making MLlib APIs extensible for 3rd party libraries
> 
>
> Key: SPARK-19498
> URL: https://issues.apache.org/jira/browse/SPARK-19498
> Project: Spark
>  Issue Type: Brainstorming
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Priority: Critical
>
> Per the recent discussion on the dev list, this JIRA is for discussing how we 
> can make MLlib DataFrame-based APIs more extensible, especially for the 
> purpose of writing 3rd-party libraries with APIs extended from the MLlib APIs 
> (for custom Transformers, Estimators, etc.).
> * For people who have written such libraries, what issues have you run into?
> * What APIs are not public or extensible enough?  Do they require changes 
> before being made more public?
> * Are APIs for non-Scala languages such as Java and Python friendly or 
> extensive enough?
> The easy answer is to make everything public, but that would be terrible of 
> course in the long-term.  Let's discuss what is needed and how we can present 
> stable, sufficient, and easy-to-use APIs for 3rd-party developers.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24359:
--
Shepherd: Xiangrui Meng  (was: Joseph K. Bradley)

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few 

[jira] [Updated] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24097:
--
Shepherd:   (was: Joseph K. Bradley)

> Instruments improvements - RandomForest and GradientBoostedTree
> ---
>
> Key: SPARK-24097
> URL: https://issues.apache.org/jira/browse/SPARK-24097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Instruments improvements - RandomForest and GradientBoostedTree



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21926) Compatibility between ML Transformers and Structured Streaming

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-21926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21926:
--
Shepherd:   (was: Joseph K. Bradley)

> Compatibility between ML Transformers and Structured Streaming
> --
>
> Key: SPARK-21926
> URL: https://issues.apache.org/jira/browse/SPARK-21926
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, Structured Streaming
>Affects Versions: 2.2.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> We've run into a few cases where ML components don't play nice with streaming 
> dataframes (for prediction). This ticket is meant to help aggregate these 
> known cases in one place and provide a place to discuss possible fixes.
> Failing cases:
> 1) VectorAssembler where one of the inputs is a VectorUDT column with no 
> metadata.
> Possible fixes:
> More details here SPARK-22346.
> 2) OneHotEncoder where the input is a column with no metadata.
> Possible fixes:
> a) Make OneHotEncoder an estimator (SPARK-13030).
> -b) Allow user to set the cardinality of OneHotEncoder.-



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24212) PrefixSpan in spark.ml: user guide section

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24212:
--
Shepherd:   (was: Joseph K. Bradley)

> PrefixSpan in spark.ml: user guide section
> --
>
> Key: SPARK-24212
> URL: https://issues.apache.org/jira/browse/SPARK-24212
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> See linked JIRA for the PrefixSpan API for which we need to write a user 
> guide page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14376) spark.ml parity for trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14376.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-10817) ML abstraction umbrella

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-10817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-10817:
-

Assignee: (was: Joseph K. Bradley)

> ML abstraction umbrella
> ---
>
> Key: SPARK-10817
> URL: https://issues.apache.org/jira/browse/SPARK-10817
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for discussing and creating ML abstractions.  This was 
> originally handled under [SPARK-1856] and [SPARK-3702], under which we 
> created the Pipelines API and some Developer APIs for classification and 
> regression.
> This umbrella is for future work, including:
> * Stabilizing the classification and regression APIs
> * Discussing traits vs. abstract classes for abstraction APIs
> * Creating other abstractions not yet covered (clustering, multilabel 
> prediction, etc.)
> Note that [SPARK-3702] still has useful discussion and design docs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-5572) LDA improvement listing

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-5572:


Assignee: (was: Joseph K. Bradley)

> LDA improvement listing
> ---
>
> Key: SPARK-5572
> URL: https://issues.apache.org/jira/browse/SPARK-5572
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing planned improvements to Latent Dirichlet 
> Allocation (LDA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-4285) Transpose RDD[Vector] to column store for ML

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-4285?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-4285:


Assignee: (was: Joseph K. Bradley)

> Transpose RDD[Vector] to column store for ML
> 
>
> Key: SPARK-4285
> URL: https://issues.apache.org/jira/browse/SPARK-4285
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> For certain ML algorithms, a column store is more efficient than a row store 
> (which is currently used everywhere).  E.g., deep decision trees can be 
> faster to train when partitioning by features.
> Proposal: Provide a method with the following API (probably in util/):
> ```
> def rowToColumnStore(data: RDD[Vector]): RDD[(Int, Vector)]
> ```
> The input Vectors will be data rows/instances, and the output Vectors will be 
> columns/features paired with column/feature indices.
> **Question**: Is it important to maintain matrix structure?  That is, should 
> output Vectors in the same partition be adjacent columns in the matrix?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-5572) LDA improvement listing

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-5572:
-
Shepherd:   (was: Joseph K. Bradley)

> LDA improvement listing
> ---
>
> Key: SPARK-5572
> URL: https://issues.apache.org/jira/browse/SPARK-5572
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing planned improvements to Latent Dirichlet 
> Allocation (LDA).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7206) Gaussian Mixture Model (GMM) improvements

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7206?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-7206:


Assignee: (was: Joseph K. Bradley)

> Gaussian Mixture Model (GMM) improvements
> -
>
> Key: SPARK-7206
> URL: https://issues.apache.org/jira/browse/SPARK-7206
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella JIRA for listing improvements for GMMs:
> * planned improvements
> * optional/experimental work
> * tests for verifying scalability



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14376) spark.ml parity for trees

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-14376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511751#comment-16511751
 ] 

Joseph K. Bradley commented on SPARK-14376:
---

Thanks!  I'll close it.

> spark.ml parity for trees
> -
>
> Key: SPARK-14376
> URL: https://issues.apache.org/jira/browse/SPARK-14376
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> Review parity of spark.ml vs. spark.mllib to ensure spark.ml contains all 
> functionality.  List all missing items.
> This only covers Scala since we can compare Scala vs. Python in spark.ml 
> itself.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14604) Modify design of ML model summaries

2018-06-13 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-14604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14604:
-

Assignee: (was: Joseph K. Bradley)

> Modify design of ML model summaries
> ---
>
> Key: SPARK-14604
> URL: https://issues.apache.org/jira/browse/SPARK-14604
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Several spark.ml models now have summaries containing evaluation metrics and 
> training info:
> * LinearRegressionModel
> * LogisticRegressionModel
> * GeneralizedLinearRegressionModel
> These summaries have unfortunately been added in an inconsistent way.  I 
> propose to reorganize them to have:
> * For each model, 1 summary (without training info) and 1 training summary 
> (with info from training).  The non-training summary can be produced for a 
> new dataset via {{evaluate}}.
> * A summary should not store the model itself as a public field.
> * A summary should provide a transient reference to the dataset used to 
> produce the summary.
> This task will involve reorganizing the GLM summary (which lacks a 
> training/non-training distinction) and deprecating the model method in the 
> LinearRegressionSummary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22666) Spark datasource for image format

2018-06-13 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511745#comment-16511745
 ] 

Joseph K. Bradley commented on SPARK-22666:
---

Side note: The Java library we use for reading images does not support as many 
image types as PIL, which we used in the image loader in 
https://github.com/databricks/spark-deep-learning  This should not block this 
task, but it'd be worth poking around to see if there are alternative libraries 
we could use from Java.

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-06-07 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16505049#comment-16505049
 ] 

Joseph K. Bradley commented on SPARK-24359:
---

It sounds like everyone is agreed on the fact that we need to be able to 
release SparkML more frequently than Spark (because of CRAN issues).

The only remaining point of disagreement is whether we should keep SparkML in a 
separate repo for now.  Here are some pros & cons I can think of.  What do 
y'all think?

Keeping SparkML in a separate repo:
* Tagging releases would be easier.  If a set of SparkML releases (2.4.0.1, 
2.4.0.2) should be tested against a single Spark release (2.4.0), then separate 
repos could make this tracking easier.  With a single repo, we would have to be 
careful to keep the 2.4.0.1 tag from containing changes not in 2.4.0, which 
would require some careful handling of release branches.
* Testing could be simpler.  When testing SparkML and Spark on the master 
branch, should SparkML test against Spark master or against the latest Spark 
release tag?

Cons of keeping SparkML in a separate repo:
* Changes in Spark could break SparkML.  SparkML's CI / PR tests would test 
against Spark, but we should not have Spark PRs test against SparkML since that 
could create a loop blocking any change.
* This might require more work to put build systems and CI tests in place.

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in 

[jira] [Created] (SPARK-24467) VectorAssemblerEstimator

2018-06-05 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24467:
-

 Summary: VectorAssemblerEstimator
 Key: SPARK-24467
 URL: https://issues.apache.org/jira/browse/SPARK-24467
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


In [SPARK-22346], I believe I made a wrong API decision: I recommended added 
`VectorSizeHint` instead of making `VectorAssembler` into an Estimator since I 
thought the latter option would break most workflows.  However, I should have 
proposed:
* Add a Param to VectorAssembler for specifying the sizes of Vectors in the 
inputCols.  This Param can be optional.  If not given, then VectorAssembler 
will behave as it does now.  If given, then VectorAssembler can use that info 
instead of figuring out the Vector sizes via metadata or examining Rows in the 
data (though it could do consistency checks).
* Add a VectorAssemblerEstimator which gets the Vector lengths from data and 
produces a VectorAssembler with the vector lengths Param specified.

This will not break existing workflows.  Migrating to VectorAssemblerEstimator 
will be easier than adding VectorSizeHint since it will not require users to 
manually input Vector lengths.

Note: Even with this Estimator, VectorSizeHint might prove useful for other 
things in the future which require vector length metadata, so we could consider 
keeping it rather than deprecating it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Description: 
Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
are the final Transformers which are not compatible).  These do not work 
because Spark SQL does not support nested types containing UDTs; see 
[SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Locality Sensitive Hashing (LSH) Models (BucketedRandomProjectionLSHModel, 
> MinHashLSHModel) are not compatible with Structured Streaming (and I believe 
> are the final Transformers which are not compatible).  These do not work 
> because Spark SQL does not support nested types containing UDTs; see 
> [SPARK-12878].
> This task is to add unit tests for streaming (as in [SPARK-22644]) for 
> LSHModels after [SPARK-12878] has been fixed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24465:
-

 Summary: LSHModel should support Structured Streaming for transform
 Key: SPARK-24465
 URL: https://issues.apache.org/jira/browse/SPARK-24465
 Project: Spark
  Issue Type: New Feature
  Components: ML
Affects Versions: 2.4.0
 Environment: Locality Sensitive Hashing (LSH) Models 
(BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with 
Structured Streaming (and I believe are the final Transformers which are not 
compatible).  These do not work because Spark SQL does not support nested types 
containing UDTs; see [SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.
Reporter: Joseph K. Bradley






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24465) LSHModel should support Structured Streaming for transform

2018-06-04 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-24465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24465:
--
Environment: (was: Locality Sensitive Hashing (LSH) Models 
(BucketedRandomProjectionLSHModel, MinHashLSHModel) are not compatible with 
Structured Streaming (and I believe are the final Transformers which are not 
compatible).  These do not work because Spark SQL does not support nested types 
containing UDTs; see [SPARK-12878].

This task is to add unit tests for streaming (as in [SPARK-22644]) for 
LSHModels after [SPARK-12878] has been fixed.)

> LSHModel should support Structured Streaming for transform
> --
>
> Key: SPARK-24465
> URL: https://issues.apache.org/jira/browse/SPARK-24465
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-31 Thread Joseph K. Bradley (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16497203#comment-16497203
 ] 

Joseph K. Bradley commented on SPARK-24359:
---

Clarification question: [~falaki] did you mean to say that the CRAN package 
SparkML will be updated for every *minor* release (2.3, 2.4, etc.)?  (I assume 
you did not mean every major release (3.0, 4.0, etc.) since those only happen 
every 2 years or so.)

I'd recommend we follow the same pattern as for the SparkR package: Updates to 
SparkML and SparkR will require official Spark releases, limiting us to 
patching SparkML only when there is a new Spark patch release (2.3.1, 2.3.2, 
etc.).  I feel like that's a lesser evil than the only other option I know of: 
splitting off SparkR and/or SparkML into completely separate projects under 
different Apache or non-Apache oversight.  What do you think?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., {{spark_logistic_regression()}} and 
> {{set_max_iter()}}). If a constructor gets arguments, they will be named 
> arguments. For example:
> {code:java}
> > lr <- 

[jira] [Updated] (SPARK-22666) Spark datasource for image format

2018-05-29 Thread Joseph K. Bradley (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-22666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22666:
--
Summary: Spark datasource for image format  (was: Spark reader source for 
image format)

> Spark datasource for image format
> -
>
> Key: SPARK-22666
> URL: https://issues.apache.org/jira/browse/SPARK-22666
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current API for the new image format is implemented as a standalone 
> feature, in order to make it reside within the mllib package. As discussed in 
> SPARK-21866, users should be able to load images through the more common 
> spark source reader interface.
> This ticket is concerned with adding image reading support in the spark 
> source API, through either of the following interfaces:
>  - {{spark.read.format("image")...}}
>  - {{spark.read.image}}
> The output is a dataframe that contains images (and the file names for 
> example), following the semantics discussed already in SPARK-21866.
> A few technical notes:
> * since the functionality is implemented in {{mllib}}, calling this function 
> may fail at runtime if users have not imported the {{spark-mllib}} dependency
> * How to deal with very flat directories? It is common to have millions of 
> files in a single "directory" (like in S3), which seems to have caused some 
> issues to some users. If this issue is too complex to handle in this ticket, 
> it can be dealt with separately.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491412#comment-16491412
 ] 

Joseph K. Bradley commented on SPARK-24359:
---

Regarding separating repos: What's the conclusion?  If feasible, I really hope 
this can be in the apache/spark repo to encourage contributors to add R 
wrappers whenever they add new MLlib APIs (just like it's pretty easy to add 
Python wrappers nowadays).

Regarding CRAN releases: I'd expect it to be well worth it to say SparkML minor 
releases correspond to Spark minor releases.  Users should not expect SparkML 
2.4 to work with Spark 2.3 (since R would encounter missing Java APIs).  I'm 
less sure about patch releases.  (Ideally, this would all be solved by us 
following semantic versioning, but that would require that we never add 
Experimental APIs to SparkML.)  If we can solve the maintainability issues with 
CRAN compatibility via integration tests, then I figure it'd be ideal to treat 
SparkML just like SparkR and PySpark, releasing in sync with the rest of Spark. 
 Thoughts?

> SPIP: ML Pipelines in R
> ---
>
> Key: SPARK-24359
> URL: https://issues.apache.org/jira/browse/SPARK-24359
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 3.0.0
>Reporter: Hossein Falaki
>Priority: Major
>  Labels: SPIP
> Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines 
> in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparklyr’s 
> API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their 
> parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> functions are snake_case (e.g., 

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

2018-05-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24359:
--
Description: 
h1. Background and motivation

SparkR supports calling MLlib functionality with an [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of Apache Spark. This new package will be built on top of SparkR’s APIs to 
expose SparkML’s pipeline APIs and functionality.

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparklyr’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 ** create a pipeline by chaining individual components and specifying their 
parameters
 ** tune a pipeline in parallel, taking advantage of Spark
 ** inspect a pipeline’s parameters and evaluation metrics
 ** repeatedly apply a pipeline
 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All functions 
are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). 
If a constructor gets arguments, they will be named arguments. For example:
{code:java}
> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:
{code:java}
> logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with {{spark_}}.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  
{code:java}
> pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*
{code:java}
> tokenizer <- spark_tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- tokenizer %>% transform(df) {code}

[jira] [Commented] (SPARK-23455) Default Params in ML should be saved separately

2018-05-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16491354#comment-16491354
 ] 

Joseph K. Bradley commented on SPARK-23455:
---

Yep, thanks [~viirya] for answering!  It will affect R only if we add ways for 
people to write custom ML Transformers and Estimators in R.

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24300) generateLDAData in ml.cluster.LDASuite didn't set seed correctly

2018-05-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24300:
--
Shepherd: Joseph K. Bradley

> generateLDAData in ml.cluster.LDASuite didn't set seed correctly
> 
>
> Key: SPARK-24300
> URL: https://issues.apache.org/jira/browse/SPARK-24300
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Xiangrui Meng
>Assignee: Lu Wang
>Priority: Minor
>
> [https://github.com/apache/spark/blob/0d63ebd17df747fb41d7ba254718bb7af3ae/mllib/src/test/scala/org/apache/spark/ml/clustering/LDASuite.scala]
>  
> generateLDAData uses the same RNG in all partitions to generate random data. 
> This either causes duplicate rows in cluster mode or indeterministic behavior 
> in local mode:
> {code:java}
> scala> val rng = new java.util.Random(10)
> rng: java.util.Random = java.util.Random@78c5ef58
> scala> sc.parallelize(1 to 10).map { i => Seq.fill(10)(rng.nextInt(10)) 
> }.collect().mkString("\n")
> res12: String =
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 0, 3, 0, 6, 6, 7, 8, 1, 4)
> List(3, 9, 1, 8, 5, 0, 6, 3, 3, 8){code}
> We should create one RNG per partition to make it safe.
>  
> cc: [~lu.DB] [~josephkb]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24333) Add fit with validation set to spark.ml GBT: Python API

2018-05-21 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24333:
-

 Summary: Add fit with validation set to spark.ml GBT: Python API
 Key: SPARK-24333
 URL: https://issues.apache.org/jira/browse/SPARK-24333
 Project: Spark
  Issue Type: New Feature
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


Python version of API added by [SPARK-7132]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-05-21 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-7132.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21129
[https://github.com/apache/spark/pull/21129]

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.4.0
>
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22884:
-

Assignee: Sandor Murakozi

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Sandor Murakozi
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >