[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-27051:
---

Assignee: (was: Yanbo Liang)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson to 2.9.8.  (was: Fasterxml Jackson version 
before 2.9.8 is affected by multiple CVEs 
[[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
CVEs [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple [CVEs | 
[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple CVEs 
> [[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs | [https://github.com/FasterXML/jackson-databind/issues/2186]], we need 
to fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple [CVEs | 
> [https://github.com/FasterXML/jackson-databind/issues/2186]], we need to fix 
> bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-27051:

Description: Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need to 
fix bump the dependent Jackson version to 2.9.8.  (was: Fasterxml Jackson 
version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the dependent Jackson version to 2.9.8.)

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186]], we need 
> to fix bump the dependent Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-27051:
---

Assignee: Yanbo Liang

> Bump Jackson version to 2.9.8
> -
>
> Key: SPARK-27051
> URL: https://issues.apache.org/jira/browse/SPARK-27051
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Major
>
> Fasterxml Jackson version before 2.9.8 is affected by multiple 
> [CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need 
> to fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27051) Bump Jackson version to 2.9.8

2019-03-04 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-27051:
---

 Summary: Bump Jackson version to 2.9.8
 Key: SPARK-27051
 URL: https://issues.apache.org/jira/browse/SPARK-27051
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.0.0
Reporter: Yanbo Liang


Fasterxml Jackson version before 2.9.8 is affected by multiple 
[CVEs|[https://github.com/FasterXML/jackson-databind/issues/2186],] we need to 
fix bump the Jackson version to 2.9.8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-23291:

Fix Version/s: 2.3.1

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.1, 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23291) SparkR : substr : In SparkR dataframe , starting and ending position arguments in "substr" is giving wrong result when the position is greater than 1

2018-05-04 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464391#comment-16464391
 ] 

Yanbo Liang commented on SPARK-23291:
-

This should be backported to Spark 2.3, as this is a bug fix and we can't wait 
several months for the next release. [~hyukjin.kwon] Do you like to send a PR? 
Thanks.

> SparkR : substr : In SparkR dataframe , starting and ending position 
> arguments in "substr" is giving wrong result  when the position is greater 
> than 1
> --
>
> Key: SPARK-23291
> URL: https://issues.apache.org/jira/browse/SPARK-23291
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.1.2, 2.2.0, 2.2.1, 2.3.0
>Reporter: Narendra
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> Defect Description :
> -
> For example ,an input string "2017-12-01" is read into a SparkR dataframe 
> "df" with column name "col1".
>  The target is to create a a new column named "col2" with the value "12" 
> which is inside the string ."12" can be extracted with "starting position" as 
> "6" and "Ending position" as "7"
>  (the starting position of the first character is considered as "1" )
> But,the current code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,7,8)))
> Observe that the first argument in the "substr" API , which indicates the 
> 'starting position', is mentioned as "7" 
>  Also, observe that the second argument in the "substr" API , which indicates 
> the 'ending position', is mentioned as "8"
> i.e the number that should be mentioned to indicate the position should be 
> the "actual position + 1"
> Expected behavior :
> 
> The code that needs to be written is :
>  
>  df <- withColumn(df,"col2",substr(df$col1,6,7)))
> Note :
> ---
>  This defect is observed with only when the starting position is greater than 
> 1.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-03 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-15784:

Shepherd: Joseph K. Bradley  (was: Yanbo Liang)

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-04-03 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16424609#comment-16424609
 ] 

Yanbo Liang commented on SPARK-15784:
-

[~josephkb] Please take over this, I'm very busy recently and don't have time 
to shepherd this. Thanks very much.

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347658#comment-16347658
 ] 

Yanbo Liang commented on SPARK-23107:
-

[~sameerag] Yes, the fix should be API scope change, but I think we can get 
them merged in one or two days. When do you plan to cut the next RC? Thanks.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347386#comment-16347386
 ] 

Yanbo Liang commented on SPARK-23107:
-

[~mlnick] Sorry for late response, really busy recently. This task has been 
almost finished, I will submit the PR today or tomorrow. Thanks.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX. Take note of:
>  * Protected/public classes or methods. If access can be more private, then 
> it should be.
>  * Also look for non-sealed traits.
>  * Documentation: Missing? Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue. 
> For *user guide issues* link the new JIRAs to the relevant user guide QA 
> issue (SPARK-23111 for {{2.3}})



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23110) ML 2.3 QA: API: Java compatibility, docs

2018-01-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347381#comment-16347381
 ] 

Yanbo Liang commented on SPARK-23110:
-

[~mlnick] Yes, we should make it \{{private[ml]}}. I can fix it at SPARK-23107, 
thanks for pick it up.

> ML 2.3 QA: API: Java compatibility, docs
> 
>
> Key: SPARK-23110
> URL: https://issues.apache.org/jira/browse/SPARK-23110
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, Java API, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Blocker
> Attachments: 1_process_script.sh, added_ml_class, 
> different_methods_in_ML.diff
>
>
> Check Java compatibility for this release:
> * APIs in {{spark.ml}}
> * New APIs in {{spark.mllib}} (There should be few, if any.)
> Checking compatibility means:
> * Checking for differences in how Scala and Java handle types. Some items to 
> look out for are:
> ** Check for generic "Object" types where Java cannot understand complex 
> Scala types.
> *** *Note*: The Java docs do not always match the bytecode. If you find a 
> problem, please verify it using {{javap}}.
> ** Check Scala objects (especially with nesting!) carefully.  These may not 
> be understood in Java, or they may be accessible only via the weirdly named 
> Java types (with "$" or "#") which are generated by the Scala compiler.
> ** Check for uses of Scala and Java enumerations, which can show up oddly in 
> the other language's doc.  (In {{spark.ml}}, we have largely tried to avoid 
> using enumerations, and have instead favored plain strings.)
> * Check for differences in generated Scala vs Java docs.  E.g., one past 
> issue was that Javadocs did not respect Scala's package private modifier.
> If you find issues, please comment here, or for larger items, create separate 
> JIRAs and link here as "requires".
> * Remember that we should not break APIs from previous releases.  If you find 
> a problem, check if it was introduced in this Spark release (in which case we 
> can fix it) or in a previous one (in which case we can create a java-friendly 
> version of the API).
> * If needed for complex issues, create small Java unit tests which execute 
> each method.  (Algorithmic correctness can be checked in Scala.)
> Recommendations for how to complete this task:
> * There are not great tools.  In the past, this task has been done by:
> ** Generating API docs
> ** Building JAR and outputting the Java class signatures for MLlib
> ** Manually inspecting and searching the docs and class signatures for issues
> * If you do have ideas for better tooling, please say so we can make this 
> task easier in the future!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23154) Document backwards compatibility guarantees for ML persistence

2018-01-22 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23154?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334757#comment-16334757
 ] 

Yanbo Liang commented on SPARK-23154:
-

Sounds good! It should be helpful to document backwards compatibility. Further 
more, I think we can write some tools to test the backwards compatibility for 
ML persistence during QA of releasing, just like performance regression test. 
Thanks.

> Document backwards compatibility guarantees for ML persistence
> --
>
> Key: SPARK-23154
> URL: https://issues.apache.org/jira/browse/SPARK-23154
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
>
> We have (as far as I know) maintained backwards compatibility for ML 
> persistence, but this is not documented anywhere.  I'd like us to document it 
> (for spark.ml, not for spark.mllib).
> I'd recommend something like:
> {quote}
> In general, MLlib maintains backwards compatibility for ML persistence.  
> I.e., if you save an ML model or Pipeline in one version of Spark, then you 
> should be able to load it back and use it in a future version of Spark.  
> However, there are rare exceptions, described below.
> Model persistence: Is a model or Pipeline saved using Apache Spark ML 
> persistence in Spark version X loadable by Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Yes; these are backwards compatible.
> * Note about the format: There are no guarantees for a stable persistence 
> format, but model loading itself is designed to be backwards compatible.
> Model behavior: Does a model or Pipeline in Spark version X behave 
> identically in Spark version Y?
> * Major versions: No guarantees, but best-effort.
> * Minor and patch versions: Identical behavior, except for bug fixes.
> For both model persistence and model behavior, any breaking changes across a 
> minor version or patch version are reported in the Spark version release 
> notes. If a breakage is not reported in release notes, then it should be 
> treated as a bug to be fixed.
> {quote}
> How does this sound?
> Note: We unfortunately don't have tests for backwards compatibility (which 
> has technical hurdles and can be discussed in [SPARK-15573]).  However, we 
> have made efforts to maintain it during PR review and Spark release QA, and 
> most users expect it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16328253#comment-16328253
 ] 

Yanbo Liang commented on SPARK-23107:
-

I'm going to work on this, assigned to myself. Thanks.

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23107) ML, Graph 2.3 QA: API: New Scala APIs, docs

2018-01-16 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-23107:
---

Assignee: Yanbo Liang

> ML, Graph 2.3 QA: API: New Scala APIs, docs
> ---
>
> Key: SPARK-23107
> URL: https://issues.apache.org/jira/browse/SPARK-23107
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>Priority: Blocker
>
> Audit new public Scala APIs added to MLlib & GraphX.  Take note of:
> * Protected/public classes or methods.  If access can be more private, then 
> it should be.
> * Also look for non-sealed traits.
> * Documentation: Missing?  Bad links or formatting?
> *Make sure to check the object doc!*
> As you find issues, please create JIRAs and link them to this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-20 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-22810.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
> Fix For: 2.3.0
>
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22810?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-22810:
---

Assignee: Yanbo Liang

> PySpark supports LinearRegression with huber loss
> -
>
> Key: SPARK-22810
> URL: https://issues.apache.org/jira/browse/SPARK-22810
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22810) PySpark supports LinearRegression with huber loss

2017-12-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-22810:
---

 Summary: PySpark supports LinearRegression with huber loss
 Key: SPARK-22810
 URL: https://issues.apache.org/jira/browse/SPARK-22810
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Expose Python API for LinearRegression with huber loss.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3181) Add Robust Regression Algorithm with Huber Estimator

2017-12-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-3181.

   Resolution: Fixed
Fix Version/s: 2.3.0

> Add Robust Regression Algorithm with Huber Estimator
> 
>
> Key: SPARK-3181
> URL: https://issues.apache.org/jira/browse/SPARK-3181
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Fan Jiang
>Assignee: Yanbo Liang
>  Labels: features
> Fix For: 2.3.0
>
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Linear least square estimates assume the error has normal distribution and 
> can behave badly when the errors are heavy-tailed. In practical we get 
> various types of data. We need to include Robust Regression  to employ a 
> fitting criterion that is not as vulnerable as least square.
> In 1973, Huber introduced M-estimation for regression which stands for 
> "maximum likelihood type". The method is resistant to outliers in the 
> response variable and has been widely used.
> The new feature for MLlib will contain 3 new files
> /main/scala/org/apache/spark/mllib/regression/RobustRegression.scala
> /test/scala/org/apache/spark/mllib/regression/RobustRegressionSuite.scala
> /main/scala/org/apache/spark/examples/mllib/HuberRobustRegression.scala
> and one new class HuberRobustGradient in 
> /main/scala/org/apache/spark/mllib/optimization/Gradient.scala



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-12-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-22289.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.2

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
> Fix For: 2.2.2, 2.3.0
>
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-11-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-22289:

Shepherd: Yanbo Liang

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-11-06 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-22289:
---

Assignee: yuhao yang

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22289) Cannot save LogisticRegressionClassificationModel with bounds on coefficients

2017-10-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16208601#comment-16208601
 ] 

Yanbo Liang commented on SPARK-22289:
-

+1 for option 2. Please feel free to send a PR. Thanks.

> Cannot save LogisticRegressionClassificationModel with bounds on coefficients
> -
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21981:
---

Assignee: Marco Gaido

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-21 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21981.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
> Fix For: 2.3.0
>
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21854) Python interface for MLOR summary

2017-09-13 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21854?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21854.
-
   Resolution: Fixed
 Assignee: Ming Jiang
Fix Version/s: 2.3.0

> Python interface for MLOR summary
> -
>
> Key: SPARK-21854
> URL: https://issues.apache.org/jira/browse/SPARK-21854
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Ming Jiang
> Fix For: 2.3.0
>
>
> Python interface for MLOR summary



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162759#comment-16162759
 ] 

Yanbo Liang commented on SPARK-21981:
-

[~mgaido] Would you like to work on this?

> Python API for ClusteringEvaluator
> --
>
> Key: SPARK-21981
> URL: https://issues.apache.org/jira/browse/SPARK-21981
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
> API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21981) Python API for ClusteringEvaluator

2017-09-12 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21981:
---

 Summary: Python API for ClusteringEvaluator
 Key: SPARK-21981
 URL: https://issues.apache.org/jira/browse/SPARK-21981
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


We have implemented {{ClusteringEvaluator}} in SPARK-14516, we should expose 
API for PySpark.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14516) Clustering evaluator

2017-09-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-14516.
-
  Resolution: Fixed
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
> Fix For: 2.3.0
>
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21940) Support timezone for timestamps in SparkR

2017-09-11 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16160936#comment-16160936
 ] 

Yanbo Liang commented on SPARK-21940:
-

[~falaki] AFAIK, Spark SQL timestamps are normalized to UTC based on available 
time zone information and stored as UTC. I think R as.double(time) do the same 
as Spark SQL. So any data of TimeStamp type will be interpreted according the 
timezone that users operate at. If users must to interpret time to a specific 
timezone, they can set their local timezone with {{Sys.setenv(TZ='GMT')}}. 
However, if users would like to bind the timezone with timestamp, I would 
recommend them to store timestamp as string and use UDF to operate them. What 
do you think of it? Thanks.

> Support timezone for timestamps in SparkR
> -
>
> Key: SPARK-21940
> URL: https://issues.apache.org/jira/browse/SPARK-21940
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Hossein Falaki
>
> {{SparkR::createDataFrame()}} wipes timezone attribute from POSIXct and 
> POSIXlt. See following example:
> {code}
> > x <- data.frame(x = c(Sys.time()))
> > x
> x
> 1 2017-09-06 19:17:16
> > attr(x$x, "tzone") <- "Europe/Paris"
> > x
> x
> 1 2017-09-07 04:17:16
> > collect(createDataFrame(x))
> x
> 1 2017-09-06 19:17:16
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-10 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang closed SPARK-21965.
---
Resolution: Duplicate

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21965:
---

Assignee: Yanbo Liang

> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR
> ---
>
> Key: SPARK-21965
> URL: https://issues.apache.org/jira/browse/SPARK-21965
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>
> Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21965) Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR

2017-09-09 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21965:
---

 Summary: Add createOrReplaceGlobalTempView and dropGlobalTempView 
for SparkR
 Key: SPARK-21965
 URL: https://issues.apache.org/jira/browse/SPARK-21965
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Add createOrReplaceGlobalTempView and dropGlobalTempView for SparkR.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16157027#comment-16157027
 ] 

Yanbo Liang commented on SPARK-21919:
-

[~srowen] You are right, that is caused by line search bug. The error log in 
2.2.0 can tell us what happened. Thanks for dig into it.

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
> quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result is:
> Coefficients: [-0.496304411053,0.198452172529]
> Intercept: 2.6380898963056327
> Scale: 1.5472363533632303
> ||label||censor||features  ||prediction   || quantiles ||
> |1.218|1.0   |[1.56,-0.605] |5.718985621018951 | 
> [1.160322990805951,4.99546058340675]|
> |2.949|0.0   |[0.346,2.158] |18.07678210850554 
> |[3.66759199449632,15.789837303662042]|
> |3.627|0.0   |[1.38,0.231]  |7.381908879359964 
> |[1.4977129086101573,6.4480027195054905]|
> |0.273|1.0   |[0.52,1.151]  
> |13.577717814884505|[2.754778414791513,11.859962351993202]|
> |4.199|0.0   |[0.795,-0.226]|9.013087597344805 
> |[1.828662187733188,7.8728164067854856]|
> But if we change the value of all labels as label + 20. as:
> {code}
> training = spark.createDataFrame([
> (21.218, 1.0, Vectors.dense(1.560, -0.605)),
> (22.949, 0.0, Vectors.dense(0.346, 2.158)),
> (23.627, 0.0, Vectors.dense(1.380, 0.231)),
> (20.273, 1.0, Vectors.dense(0.520, 1.151)),
> (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
>  quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result changes to:
> Coefficients: [23.9932020748,3.18105314757]
> Intercept: 7.35052273751137
> Scale: 7698609960.724161
> ||label ||censor||features  ||prediction   ||quantiles||
> |21.218|1.0   |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]|
> |22.949|0.0   |[0.346,2.158] |6.011158613411288E9  |[0.0,0.0]|
> |23.627|0.0   |[1.38,0.231]  |7.7835948690311181E17|[0.0,0.0]|
> |20.273|1.0   |[0.52,1.151]  |1.5880852723124176E10|[0.0,0.0]|
> |24.199|0.0   |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]|
> Can someone please explain this exponential blow up in prediction, as per my 
> understanding prediction in AFT is a prediction of the time when the failure 
> event will occur, not able to understand why it will change exponentially 
> against the value of the label.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979
 ] 

Yanbo Liang edited comment on SPARK-21919 at 9/7/17 2:02 PM:
-

[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}. I just paste your code into 
console of {{bin/pyspark}} and got:
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}


was (Author: yanboliang):
[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (3.627, 0.0, Vectors.dense(1.380, 0.231)),

[jira] [Comment Edited] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979
 ] 

Yanbo Liang edited comment on SPARK-21919 at 9/7/17 2:02 PM:
-

[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}. I just pasted your code into 
{{bin/pyspark}} and got:
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}


was (Author: yanboliang):
[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}. I just paste your code into 
console of {{bin/pyspark}} and got:
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)

[jira] [Comment Edited] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979
 ] 

Yanbo Liang edited comment on SPARK-21919 at 9/7/17 1:59 PM:
-

[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}


was (Author: yanboliang):
[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (

[jira] [Comment Edited] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979
 ] 

Yanbo Liang edited comment on SPARK-21919 at 9/7/17 1:58 PM:
-

[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}


was (Author: yanboliang):
[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.5
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.25
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.5
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.25
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.125
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209

[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-07 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156979#comment-16156979
 ] 

Yanbo Liang commented on SPARK-21919:
-

[~ashishchopra0308] [~srowen] I can't reproduce this issue, I can get correct 
result which is consistent with R {{survreg}}.
{code}
>>> from pyspark.ml.regression import AFTSurvivalRegression
>>> from pyspark.ml.linalg import Vectors
>>> training = spark.createDataFrame([
... (21.218, 1.0, Vectors.dense(1.560, -0.605)),
... (22.949, 0.0, Vectors.dense(0.346, 2.158)),
... (23.627, 0.0, Vectors.dense(1.380, 0.231)),
... (20.273, 1.0, Vectors.dense(0.520, 1.151)),
... (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor",
... "features"])
>>> quantileProbabilities = [0.3, 0.6]
>>> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
...  quantilesCol="quantiles")
>>> model = aft.fit(training)
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.5
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.25
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.5
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.25
17/09/07 21:54:31 ERROR StrongWolfeLineSearch: Encountered bad values in 
function evaluation. Decreasing step size to 0.125
>>> print("Coefficients: " + str(model.coefficients))
Coefficients: [-0.065814695216,0.00326705958509]
>>> print("Intercept: " + str(model.intercept))
Intercept: 3.29140205698
>>> print("Scale: " + str(model.scale))
Scale: 0.109856123692
>>> model.transform(training).show(truncate=False)
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeSystemBLAS
17/09/07 21:55:05 WARN BLAS: Failed to load implementation from: 
com.github.fommil.netlib.NativeRefBLAS
+--+--+--+--+---+
|label |censor|features  |prediction|quantiles  
|
+--+--+--+--+---+
|21.218|1.0   |[1.56,-0.605] |24.20972861807431 
|[21.617443110471118,23.97833624826161] |
|22.949|0.0   |[0.346,2.158] 
|26.461225875981285|[23.627858619625105,26.208314087493857]|
|23.627|0.0   |[1.38,0.231]  
|24.565240805031497|[21.934888406858644,24.330450511651165]|
|20.273|1.0   |[0.52,1.151]  
|26.074003958175602|[23.28209894956245,25.82479316934075]  |
|24.199|0.0   
|[0.795,-0.226]|25.491396901107077|[22.761875236582238,25.247754569057985]|
+--+--+--+--+---+
{code}

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
> quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result is:
> Coefficients: [-0.496304411053,0.198452172529]
> Intercept: 2.6380898963056327
> Scale: 1.5472363533632303
> ||label||censor||features  ||prediction   || quantiles ||
> |1.218|1.0   |[1.56,-0.605] |5.718985621018951 | 
> [1.160322990805951,4.99546058340675]|
> |2.949|0.0   |[0.346,2.158] |18.07678210850554 
> |[3.66759199449632,15.789837303662042]|
> |3.627|0.0   |[1.38,0.231]  |7.381908879359964 
> |[1.4977129086101573,6.4480027195054905]|
> |0.273|1.0   |[0.

[jira] [Commented] (SPARK-21919) inconsistent behavior of AFTsurvivalRegression algorithm

2017-09-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16156579#comment-16156579
 ] 

Yanbo Liang commented on SPARK-21919:
-

[~srowen] I will take a look at it. Thanks.

> inconsistent behavior of AFTsurvivalRegression algorithm
> 
>
> Key: SPARK-21919
> URL: https://issues.apache.org/jira/browse/SPARK-21919
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 2.2.0
> Environment: Spark Version: 2.2.0
> Cluster setup: Standalone single node
> Python version: 3.5.2
>Reporter: Ashish Chopra
>
> Took the direct example from spark ml documentation.
> {code}
> training = spark.createDataFrame([
> (1.218, 1.0, Vectors.dense(1.560, -0.605)),
> (2.949, 0.0, Vectors.dense(0.346, 2.158)),
> (3.627, 0.0, Vectors.dense(1.380, 0.231)),
> (0.273, 1.0, Vectors.dense(0.520, 1.151)),
> (4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
> quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result is:
> Coefficients: [-0.496304411053,0.198452172529]
> Intercept: 2.6380898963056327
> Scale: 1.5472363533632303
> ||label||censor||features  ||prediction   || quantiles ||
> |1.218|1.0   |[1.56,-0.605] |5.718985621018951 | 
> [1.160322990805951,4.99546058340675]|
> |2.949|0.0   |[0.346,2.158] |18.07678210850554 
> |[3.66759199449632,15.789837303662042]|
> |3.627|0.0   |[1.38,0.231]  |7.381908879359964 
> |[1.4977129086101573,6.4480027195054905]|
> |0.273|1.0   |[0.52,1.151]  
> |13.577717814884505|[2.754778414791513,11.859962351993202]|
> |4.199|0.0   |[0.795,-0.226]|9.013087597344805 
> |[1.828662187733188,7.8728164067854856]|
> But if we change the value of all labels as label + 20. as:
> {code}
> training = spark.createDataFrame([
> (21.218, 1.0, Vectors.dense(1.560, -0.605)),
> (22.949, 0.0, Vectors.dense(0.346, 2.158)),
> (23.627, 0.0, Vectors.dense(1.380, 0.231)),
> (20.273, 1.0, Vectors.dense(0.520, 1.151)),
> (24.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", 
> "features"])
> quantileProbabilities = [0.3, 0.6]
> aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
>  quantilesCol="quantiles")
> #aft = AFTSurvivalRegression()
> model = aft.fit(training)
> 
> # Print the coefficients, intercept and scale parameter for AFT survival 
> regression
> print("Coefficients: " + str(model.coefficients))
> print("Intercept: " + str(model.intercept))
> print("Scale: " + str(model.scale))
> model.transform(training).show(truncate=False)
> {code}
> result changes to:
> Coefficients: [23.9932020748,3.18105314757]
> Intercept: 7.35052273751137
> Scale: 7698609960.724161
> ||label ||censor||features  ||prediction   ||quantiles||
> |21.218|1.0   |[1.56,-0.605] |4.0912442688237169E18|[0.0,0.0]|
> |22.949|0.0   |[0.346,2.158] |6.011158613411288E9  |[0.0,0.0]|
> |23.627|0.0   |[1.38,0.231]  |7.7835948690311181E17|[0.0,0.0]|
> |20.273|1.0   |[0.52,1.151]  |1.5880852723124176E10|[0.0,0.0]|
> |24.199|0.0   |[0.795,-0.226]|1.4590190884193677E11|[0.0,0.0]|
> Can someone please explain this exponential blow up in prediction, as per my 
> understanding prediction in AFT is a prediction of the time when the failure 
> event will occur, not able to understand why it will change exponentially 
> against the value of the label.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-06 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16155422#comment-16155422
 ] 

Yanbo Liang commented on SPARK-21866:
-

[~timhunter] Fair enough.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some informa

[jira] [Commented] (SPARK-15689) Data source API v2

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153709#comment-16153709
 ] 

Yanbo Liang commented on SPARK-15689:
-

[~cloud_fan] Great doc! I like the direction this is going in, especially for 
columnar read interface, schema inference interface and schema evolution. 
Thanks.

> Data source API v2
> --
>
> Key: SPARK-15689
> URL: https://issues.apache.org/jira/browse/SPARK-15689
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Reynold Xin
>  Labels: SPIP, releasenotes
> Attachments: SPIP Data Source API V2.pdf
>
>
> This ticket tracks progress in creating the v2 of data source API. This new 
> API should focus on:
> 1. Have a small surface so it is easy to freeze and maintain compatibility 
> for a long time. Ideally, this API should survive architectural rewrites and 
> user-facing API revamps of Spark.
> 2. Have a well-defined column batch interface for high performance. 
> Convenience methods should exist to convert row-oriented formats into column 
> batches for data source developers.
> 3. Still support filter push down, similar to the existing API.
> 4. Nice-to-have: support additional common operators, including limit and 
> sampling.
> Note that both 1 and 2 are problems that the current data source API (v1) 
> suffers. The current data source API has a wide surface with dependency on 
> DataFrame/SQLContext, making the data source API compatibility depending on 
> the upper level API. The current data source API is also only row oriented 
> and has to go through an expensive external data type conversion to internal 
> data type.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:39 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users. Is there any obstacle to implement like this?
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compres

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:38 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PN

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:37 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, I'd support to follow other Spark SQL data 
source to the greatest extent. Even we don't use UDT, a familiar API would make 
more users to adopt it.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image p

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang commented on SPARK-21866:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, I'd support to follow other Spark SQL data 
source to the greatest extent. Even we don't use UDT, a familiar API would make 
more users to adopt it.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the sta

[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153194#comment-16153194
 ] 

Yanbo Liang commented on SPARK-21727:
-

[~neilalex] Please feel free to take this task. Thanks.

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-04 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16152639#comment-16152639
 ] 

Yanbo Liang commented on SPARK-21727:
-

[~felixcheung] What do you mean for this comment?
{quote}
But with that said, I think we could and should make a minor change to support 
that implicitly
https://github.com/apache/spark/blob/master/R/pkg/R/serialize.R#L39
{quote}
How we can get the SerDe type of atomic vector? Just like I mentioned above,
{code}
> class(rep(0, 20))
[1] "numeric"
> class(as.list(rep(0, 20)))
[1] "list"
{code}
_class_ function can't return type _vector_, how we can determine the type of 
object is _vector_ or _numeric_ ? Thanks.


> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21727) Operating on an ArrayType in a SparkR DataFrame throws error

2017-09-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16150657#comment-16150657
 ] 

Yanbo Liang commented on SPARK-21727:
-

I can run successfully with minor change:
{code}
indices <- 1:4
myDf <- data.frame(indices)
myDf$data <- list(as.list(rep(0, 20)))
mySparkDf <- as.DataFrame(myDf)
collect(mySparkDf)
{code}
This is because rep(0, 20) is not type of list, we should convert it to list 
explicitly.
{code}
> class(rep(0, 20))
[1] "numeric"
> class(as.list(rep(0, 20)))
[1] "list"
{code}

> Operating on an ArrayType in a SparkR DataFrame throws error
> 
>
> Key: SPARK-21727
> URL: https://issues.apache.org/jira/browse/SPARK-21727
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Neil McQuarrie
>
> Previously 
> [posted|https://stackoverflow.com/questions/45056973/sparkr-dataframe-with-r-lists-as-elements]
>  this as a stack overflow question but it seems to be a bug.
> If I have an R data.frame where one of the column data types is an integer 
> *list* -- i.e., each of the elements in the column embeds an entire R list of 
> integers -- then it seems I can convert this data.frame to a SparkR DataFrame 
> just fine... SparkR treats the column as ArrayType(Double). 
> However, any subsequent operation on this SparkR DataFrame appears to throw 
> an error.
> Create an example R data.frame:
> {code}
> indices <- 1:4
> myDf <- data.frame(indices)
> myDf$data <- list(rep(0, 20))}}
> {code}
> Examine it to make sure it looks okay:
> {code}
> > str(myDf) 
> 'data.frame':   4 obs. of  2 variables:  
>  $ indices: int  1 2 3 4  
>  $ data   :List of 4
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
>..$ : num  0 0 0 0 0 0 0 0 0 0 ...
> > head(myDf)   
>   indices   data 
> 1   1 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 2   2 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 3   3 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 
> 4   4 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
> {code}
> Convert it to a SparkR DataFrame:
> {code}
> library(SparkR, lib.loc=paste0(Sys.getenv("SPARK_HOME"),"/R/lib"))
> sparkR.session(master = "local[*]")
> mySparkDf <- as.DataFrame(myDf)
> {code}
> Examine the SparkR DataFrame schema; notice that the list column was 
> successfully converted to ArrayType:
> {code}
> > schema(mySparkDf)
> StructType
> |-name = "indices", type = "IntegerType", nullable = TRUE
> |-name = "data", type = "ArrayType(DoubleType,true)", nullable = TRUE
> {code}
> However, operating on the SparkR DataFrame throws an error:
> {code}
> > collect(mySparkDf)
> 17/07/13 17:23:00 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: 
> java.lang.Double is not a valid external type for schema of array
> if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null 
> else validateexternaltype(getexternalrowfield(assertnotnull(input[0, 
> org.apache.spark.sql.Row, true]), 0, indices), IntegerType) AS indices#0
> ... long stack trace ...
> {code}
> Using Spark 2.2.0, R 3.4.0, Java 1.8.0_131, Windows 10.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138680#comment-16138680
 ] 

Yanbo Liang commented on SPARK-21770:
-

[~srowen] Of course, we should understand what outputs [0, 0, 0]. If it's 
impossible to be all-zero, then there is nothing to do. 

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138211#comment-16138211
 ] 

Yanbo Liang commented on SPARK-21770:
-

I think we are talking about the same thing, change probability prediction from 
[0.0, 0.0, 0.0, 0.0] to [0.25, 0.25, 0.25, 0.25], right?

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138202#comment-16138202
 ] 

Yanbo Liang commented on SPARK-21770:
-

Yeah, it will confuse users if they got all-zero probability prediction.

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16138188#comment-16138188
 ] 

Yanbo Liang commented on SPARK-21770:
-

[~srowen] Which bug ?

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions

2017-08-22 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16137958#comment-16137958
 ] 

Yanbo Liang commented on SPARK-21770:
-

+1 for uniform distribution. This will not change the prediction result, but 
it's reasonable enough to provide a valid probability distribution. 
Furthermore, the output of {{normalizeToProbabilitiesInPlace}} sums to 1 for 
not {{all-zero}} case, it make sense to keep the same behavior for {{all-zero}} 
case. Thanks.

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> ---
>
> Key: SPARK-21770
> URL: https://issues.apache.org/jira/browse/SPARK-21770
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Siddharth Murching
>Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21690) one-pass imputer

2017-08-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21690:
---

Assignee: zhengruifeng

> one-pass imputer
> 
>
> Key: SPARK-21690
> URL: https://issues.apache.org/jira/browse/SPARK-21690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>
> {code}
> val surrogates = $(inputCols).map { inputCol =>
>   val ic = col(inputCol)
>   val filtered = dataset.select(ic.cast(DoubleType))
> .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
>   if(filtered.take(1).length == 0) {
> throw new SparkException(s"surrogate cannot be computed. " +
>   s"All the values in $inputCol are Null, Nan or 
> missingValue(${$(missingValue)})")
>   }
>   val surrogate = $(strategy) match {
> case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
> case Imputer.median => filtered.stat.approxQuantile(inputCol, 
> Array(0.5), 0.001).head
>   }
>   surrogate
> }
> {code}
> Current impl of {{Imputer}} process one column after after another. In this 
> place, we should parallelize the processing in a more efficient way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21690) one-pass imputer

2017-08-17 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21690:

Shepherd: Yanbo Liang

> one-pass imputer
> 
>
> Key: SPARK-21690
> URL: https://issues.apache.org/jira/browse/SPARK-21690
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>
> {code}
> val surrogates = $(inputCols).map { inputCol =>
>   val ic = col(inputCol)
>   val filtered = dataset.select(ic.cast(DoubleType))
> .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN)
>   if(filtered.take(1).length == 0) {
> throw new SparkException(s"surrogate cannot be computed. " +
>   s"All the values in $inputCol are Null, Nan or 
> missingValue(${$(missingValue)})")
>   }
>   val surrogate = $(strategy) match {
> case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first()
> case Imputer.median => filtered.stat.approxQuantile(inputCol, 
> Array(0.5), 0.001).head
>   }
>   surrogate
> }
> {code}
> Current impl of {{Imputer}} process one column after after another. In this 
> place, we should parallelize the processing in a more efficient way.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21108) convert LinearSVC to aggregator framework

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21108:
---

Assignee: yuhao yang

> convert LinearSVC to aggregator framework
> -
>
> Key: SPARK-21108
> URL: https://issues.apache.org/jira/browse/SPARK-21108
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: yuhao yang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14516) Clustering evaluator

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-14516:
---

Assignee: Marco Gaido

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>Assignee: Marco Gaido
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14516) Clustering evaluator

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14516:

 Shepherd: Yanbo Liang
Affects Version/s: 2.2.0

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2017-08-15 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21741:
---

 Summary: Python API for DataFrame-based multivariate summarizer
 Key: SPARK-21741
 URL: https://issues.apache.org/jira/browse/SPARK-21741
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.2.0
Reporter: Yanbo Liang


We support multivariate summarizer for DataFrame API at SPARK-19634, we should 
also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2017-08-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128223#comment-16128223
 ] 

Yanbo Liang commented on SPARK-21741:
-

[~WeichenXu123] Would you like to work on this? Thanks.

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-08-15 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19634.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
> Fix For: 2.3.0
>
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21481) Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF

2017-08-15 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126914#comment-16126914
 ] 

Yanbo Liang commented on SPARK-21481:
-

See my comments at https://github.com/apache/spark/pull/18736 .

> Add indexOf method in ml.feature.HashingTF similar to mllib.feature.HashingTF
> -
>
> Key: SPARK-21481
> URL: https://issues.apache.org/jira/browse/SPARK-21481
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Aseem Bansal
>
> If we want to find the index of any input based on hashing trick then it is 
> possible in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.mllib.feature.HashingTF
>  but not in 
> https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.feature.HashingTF.
> Should allow that for feature parity



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12664) Expose raw prediction scores in MultilayerPerceptronClassificationModel

2017-08-14 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16126743#comment-16126743
 ] 

Yanbo Liang commented on SPARK-12664:
-

[~josephkb] Please go ahead. Thanks for taking over shepherding. I really have 
lots of PRs waiting to review.

> Expose raw prediction scores in MultilayerPerceptronClassificationModel
> ---
>
> Key: SPARK-12664
> URL: https://issues.apache.org/jira/browse/SPARK-12664
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Robert Dodier
>Assignee: Weichen Xu
>
> In 
> org.apache.spark.ml.classification.MultilayerPerceptronClassificationModel, 
> there isn't any way to get raw prediction scores; only an integer output 
> (from 0 to #classes - 1) is available via the `predict` method. 
> `mplModel.predict` is called within the class to get the raw score, but 
> `mlpModel` is private so that isn't available to outside callers.
> The raw score is useful when the user wants to interpret the classifier 
> output as a probability. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21523) Fix bug of strong wolfe linesearch `init` parameter lose effectiveness

2017-08-08 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21523.
-
   Resolution: Fixed
 Assignee: Weichen Xu
Fix Version/s: 2.3.0
   2.2.1

> Fix bug of strong wolfe linesearch `init` parameter lose effectiveness
> --
>
> Key: SPARK-21523
> URL: https://issues.apache.org/jira/browse/SPARK-21523
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Critical
> Fix For: 2.2.1, 2.3.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> We need merge this breeze bugfix into spark because it influence a series of 
> algos in MLlib which use LBFGS.
> https://github.com/scalanlp/breeze/pull/651



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-08-07 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Fix Version/s: 2.1.2
   2.0.3

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-02 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110869#comment-16110869
 ] 

Yanbo Liang commented on SPARK-21591:
-

[~viirya] I agree there are lots of performance bottlenecks, such as 
serialization/deserialization cost between {{UnsafeRow}} and JVM object, reduce 
data copy between different format if applicable, etc. There are discussion 
about the bottlenecks at SPARK-19634 and corresponding PR. This JIRA is just 
used to track ```treeAggregate``` related issue, and it only has a significant 
impact when we handle vector of large dimension. Thanks.

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> There are lots of blocking issues for the migration, lack of 
> {{treeAggregate}} on {{DataFrame}} is one of them. {{treeAggregate}} is very 
> important for MLlib algorithms, since they do aggregate on {{Vector}} which 
> may has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
> reduces the aggregation time by an order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20601) Python API Changes for Constrained Logistic Regression Params

2017-08-02 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-20601.
-
  Resolution: Fixed
Assignee: Maciej Szymkiewicz
   Fix Version/s: 2.3.0
Target Version/s: 2.3.0

> Python API Changes for Constrained Logistic Regression Params
> -
>
> Key: SPARK-20601
> URL: https://issues.apache.org/jira/browse/SPARK-20601
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Bryan Cutler
>Assignee: Maciej Szymkiewicz
> Fix For: 2.3.0
>
>
> With the addition of SPARK-20047 for constrained logistic regression, there 
> are 4 new params not in PySpark lr
> * lowerBoundsOnCoefficients
> * upperBoundsOnCoefficients
> * lowerBoundsOnIntercepts
> * upperBoundsOnIntercepts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16110196#comment-16110196
 ] 

Yanbo Liang commented on SPARK-21591:
-

[~viirya] Yep, this is the way we are using, but we want to enjoy Tungsten 
execution engine. :)

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> There are lots of blocking issues for the migration, lack of 
> {{treeAggregate}} on {{DataFrame}} is one of them. {{treeAggregate}} is very 
> important for MLlib algorithms, since they do aggregate on {{Vector}} which 
> may has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
> reduces the aggregation time by an order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21388) GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21388.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.3.0

> GBT inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold
> 
>
> Key: SPARK-21388
> URL: https://issues.apache.org/jira/browse/SPARK-21388
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.1
>Reporter: zhengruifeng
>Assignee: zhengruifeng
> Fix For: 2.3.0
>
>
> make GBTs inherit from {{HasStepSize}} and LInearSVC/Binarizer inherit from 
> {{HasThreshold}}
> The desc for param {{StepSize}} of GBTs in Pydoc is wrong, so I also override 
> it in the python side.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14516) Clustering evaluator

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14516:

Priority: Major  (was: Minor)

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14516) Clustering evaluator

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-14516:

Issue Type: New Feature  (was: Brainstorming)

> Clustering evaluator
> 
>
> Key: SPARK-14516
> URL: https://issues.apache.org/jira/browse/SPARK-14516
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: zhengruifeng
>Priority: Minor
>
> MLlib does not have any general purposed clustering metrics with a ground 
> truth.
> In 
> [Scikit-Learn](http://scikit-learn.org/stable/modules/classes.html#clustering-metrics),
>  there are several kinds of metrics for this.
> It may be meaningful to add some clustering metrics into MLlib.
> This should be added as a {{ClusteringEvaluator}} class of extending 
> {{Evaluator}} in spark.ml.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
There are lots of blocking issues for the migration, lack of {{treeAggregate}} 
on {{DataFrame}} is one of them. {{treeAggregate}} is very important for MLlib 
algorithms, since they do aggregate on {{Vector}} which may has millions of 
elements. As we all know, {{RDD}} based {{treeAggregate}} reduces the 
aggregation time by an order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
is one of them. {{treeAggregate}} is very important for MLlib algorithms, since 
they do aggregate on {{Vector}} which may has millions of elements. As we all 
know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an order 
of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> There are lots of blocking issues for the migration, lack of 
> {{treeAggregate}} on {{DataFrame}} is one of them. {{treeAggregate}} is very 
> important for MLlib algorithms, since they do aggregate on {{Vector}} which 
> may has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
> reduces the aggregation time by an order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
is one of them. {{treeAggregate}} is very important for MLlib algorithms, since 
they do aggregate on {{Vector}} which may has millions of elements. As we all 
know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an order 
of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
is one of them. It's very important for MLlib algorithms, since they do 
aggregate on {{Vector}} which may has millions of elements. As we all know, 
{{RDD}} based {{treeAggregate}} reduces the aggregation time by an order of 
magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
> is one of them. {{treeAggregate}} is very important for MLlib algorithms, 
> since they do aggregate on {{Vector}} which may has millions of elements. As 
> we all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by 
> an order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
is one of them. It's very important for MLlib algorithms, since they do 
aggregate on {{Vector}} which may has millions of elements. As we all know, 
{{RDD}} based {{treeAggregate}} reduces the aggregation time by an order of 
magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
One of the block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> There are lots of blocking issues, lack of {{treeAggregate}} on {{DataFrame}} 
> is one of them. It's very important for MLlib algorithms, since they do 
> aggregate on {{Vector}} which may has millions of elements. As we all know, 
> {{RDD}} based {{treeAggregate}} reduces the aggregation time by an order of 
> magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
One of the block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> One of the block issue is there is no {{treeAggregate}} on {{DataFrame}}. 
> It's very important for MLlib algorithms, since they do aggregate on 
> {{Vector}} which may has millions of elements. As we all know, {{RDD}} based 
> {{treeAggregate}} reduces the aggregation time by an order of magnitude for  
> lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on vector who may 
has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
reduces the aggregation time by an order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
> very important for MLlib algorithms, since they do aggregate on vector who 
> may has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
> reduces the aggregation time by an order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except MLlib will also benefit from this improvement if we get it 
> done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except for MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
> very important for MLlib algorithms, since they do aggregate on {{Vector}} 
> which may has millions of elements. As we all know, {{RDD}} based 
> {{treeAggregate}} reduces the aggregation time by an order of magnitude for  
> lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except for MLlib will also benefit from this improvement if we get 
> it done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-08-01 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on {{Vector}} 
which may has millions of elements. As we all know, {{RDD}} based 
{{treeAggregate}} reduces the aggregation time by an order of magnitude for  
lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
very important for MLlib algorithms, since they do aggregate on vector who may 
has millions of elements. As we all know, {{RDD}} based {{treeAggregate}} 
reduces the aggregation time by an order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. It's 
> very important for MLlib algorithms, since they do aggregate on {{Vector}} 
> which may has millions of elements. As we all know, {{RDD}} based 
> {{treeAggregate}} reduces the aggregation time by an order of magnitude for  
> lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except MLlib will also benefit from this improvement if we get it 
> done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-21591) Implement treeAggregate on Dataset API

2017-07-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108476#comment-16108476
 ] 

Yanbo Liang edited comment on SPARK-21591 at 8/1/17 6:50 AM:
-

cc [~cloud_fan] [~WeichenXu123]


was (Author: yanboliang):
cc [~cloud_fan]

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
> all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
> order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except MLlib will also benefit from this improvement if we get it 
> done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-07-31 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Description: 
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues. And I think other scenarios 
except MLlib will also benefit from this improvement if we get it done.

  was:
The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues.


> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
> all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
> order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues. And I think other 
> scenarios except MLlib will also benefit from this improvement if we get it 
> done.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21591) Implement treeAggregate on Dataset API

2017-07-31 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21591:

Issue Type: Brainstorming  (was: New Feature)

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
> all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
> order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21591) Implement treeAggregate on Dataset API

2017-07-31 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108476#comment-16108476
 ] 

Yanbo Liang commented on SPARK-21591:
-

cc [~cloud_fan]

> Implement treeAggregate on Dataset API
> --
>
> Key: SPARK-21591
> URL: https://issues.apache.org/jira/browse/SPARK-21591
> Project: Spark
>  Issue Type: Brainstorming
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> The Tungsten execution engine substantially improved the efficiency of memory 
> and CPU for Spark application. However, in MLlib we still not migrate the 
> internal computing workload from {{RDD}} to {{DataFrame}}.
> The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
> all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
> order of magnitude for  lots of MLlib 
> algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
> I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} 
> API and do the performance benchmark related issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21591) Implement treeAggregate on Dataset API

2017-07-31 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21591:
---

 Summary: Implement treeAggregate on Dataset API
 Key: SPARK-21591
 URL: https://issues.apache.org/jira/browse/SPARK-21591
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.2.0
Reporter: Yanbo Liang


The Tungsten execution engine substantially improved the efficiency of memory 
and CPU for Spark application. However, in MLlib we still not migrate the 
internal computing workload from {{RDD}} to {{DataFrame}}.
The main block issue is there is no {{treeAggregate}} on {{DataFrame}}. As we 
all know, {{RDD}} based {{treeAggregate}} reduces the aggregation time by an 
order of magnitude for  lots of MLlib 
algorithms(https://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html).
I open this JIRA to discuss to implement {{treeAggregate}} on {{DataFrame}} API 
and do the performance benchmark related issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21575) Eliminate needless synchronization in java-R serialization

2017-07-30 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21575.
-
   Resolution: Fixed
 Assignee: Iurii Antykhovych
Fix Version/s: 2.3.0

> Eliminate needless synchronization in java-R serialization
> --
>
> Key: SPARK-21575
> URL: https://issues.apache.org/jira/browse/SPARK-21575
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.2.0
>Reporter: Iurii Antykhovych
>Assignee: Iurii Antykhovych
>Priority: Trivial
> Fix For: 2.3.0
>
>
> As long as {{org.apache.spark.api.r.JVMObjectTracker}} is backed by 
> {{ConcurrentHashMap}}, synchronized blocks in {{get(..)}} and {{remove(..)}} 
> methods can be safely removed.
> This would eliminate lock contention in {{org.apache.spark.api.r.SerDe}}
>  and {{org.apache.spark.api.r.RBackendHandler}}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-28 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Fix Version/s: (was: 2.1.2)
   (was: 2.0.3)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-21306.
-
   Resolution: Fixed
Fix Version/s: 2.3.0
   2.2.1
   2.1.2
   2.0.3

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
> Fix For: 2.0.3, 2.1.2, 2.2.1, 2.3.0
>
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19270) Add summary table to GLM summary

2017-07-27 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-19270.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>Priority: Minor
> Fix For: 2.3.0
>
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20307) SparkR: pass on setHandleInvalid to spark.mllib functions that use StringIndexer

2017-07-17 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090792#comment-16090792
 ] 

Yanbo Liang commented on SPARK-20307:
-

[~wangmiao1981] I think [~podongfeng] already added it in 
https://github.com/apache/spark/pull/18582 . Thanks.

> SparkR: pass on setHandleInvalid to spark.mllib functions that use 
> StringIndexer
> 
>
> Key: SPARK-20307
> URL: https://issues.apache.org/jira/browse/SPARK-20307
> Project: Spark
>  Issue Type: Improvement
>  Components: SparkR
>Affects Versions: 2.1.0
>Reporter: Anne Rutten
>Assignee: Miao Wang
>Priority: Minor
> Fix For: 2.3.0
>
>
> when training a model in SparkR with string variables (tested with 
> spark.randomForest, but i assume is valid for all spark.xx functions that 
> apply a StringIndexer under the hood), testing on a new dataset with factor 
> levels that are not in the training set will throw an "Unseen label" error. 
> I think this can be solved if there's a method to pass setHandleInvalid on to 
> the StringIndexers when calling spark.randomForest.
> code snippet:
> {code}
> # (i've run this in Zeppelin which already has SparkR and the context loaded)
> #library(SparkR)
> #sparkR.session(master = "local[*]") 
> data = data.frame(clicked = base::sample(c(0,1),100,replace=TRUE),
>   someString = base::sample(c("this", "that"), 
> 100, replace=TRUE), stringsAsFactors=FALSE)
> trainidxs = base::sample(nrow(data), nrow(data)*0.7)
> traindf = as.DataFrame(data[trainidxs,])
> testdf = as.DataFrame(rbind(data[-trainidxs,],c(0,"the other")))
> rf = spark.randomForest(traindf, clicked~., type="classification", 
> maxDepth=10, 
> maxBins=41,
> numTrees = 100)
> predictions = predict(rf, testdf)
> SparkR::collect(predictions)
> {code}
> stack trace:
> {quote}
> Error in handleErrors(returnStatus, conn): org.apache.spark.SparkException: 
> Job aborted due to stage failure: Task 0 in stage 607.0 failed 1 times, most 
> recent failure: Lost task 0.0 in stage 607.0 (TID 1581, localhost, executor 
> driver): org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$4: (string) => double)
> at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
> at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
> at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
> at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.SparkException: Unseen label: the other.
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:170)
> at 
> org.apache.spark.ml.feature.StringIndexerModel$$anonfun$4.apply(StringIndexer.scala:166)
> ... 16 more
> Driver stacktrace:
> at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
> at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
> at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
> at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
> at 
> org.apache.spark.scheduler.DAGSche

[jira] [Resolved] (SPARK-18619) Make QuantileDiscretizer/Bucketizer/StringIndexer inherit from HasHandleInvalid

2017-07-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-18619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang resolved SPARK-18619.
-
   Resolution: Fixed
 Assignee: zhengruifeng
Fix Version/s: 2.3.0

> Make QuantileDiscretizer/Bucketizer/StringIndexer inherit from 
> HasHandleInvalid
> ---
>
> Key: SPARK-18619
> URL: https://issues.apache.org/jira/browse/SPARK-18619
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Yanbo Liang
>Assignee: zhengruifeng
>Priority: Minor
> Fix For: 2.3.0
>
>
> {{QuantileDiscretizer}}, {{Bucketizer}} and {{StringIndexer}} have the same 
> param {{handleInvalid}}, but with different supported options and docs. After 
> SPARK-16151 resolved, we can make all of them inherit from 
> {{HasHandleInvalid}}, and with different supported options and docs for each 
> subclass.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21386) ML LinearRegression supports warm start from user provided initial model

2017-07-12 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21386:

Summary: ML LinearRegression supports warm start from user provided initial 
model  (was: ML LinearRegression supports warm start by user provided initial 
model)

> ML LinearRegression supports warm start from user provided initial model
> 
>
> Key: SPARK-21386
> URL: https://issues.apache.org/jira/browse/SPARK-21386
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>
> Allow the user to set the initial model when training linear regression. This 
> is the first step to make ML supports warm-start, we can distribute tasks for 
> other algorithms after this get merged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-21386) ML LinearRegression supports warm start by user provided initial model

2017-07-12 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-21386:
---

 Summary: ML LinearRegression supports warm start by user provided 
initial model
 Key: SPARK-21386
 URL: https://issues.apache.org/jira/browse/SPARK-21386
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Yanbo Liang


Allow the user to set the initial model when training linear regression. This 
is the first step to make ML supports warm-start, we can distribute tasks for 
other algorithms after this get merged.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-19270) Add summary table to GLM summary

2017-07-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-19270:
---

Assignee: Wayne Zhang

> Add summary table to GLM summary
> 
>
> Key: SPARK-19270
> URL: https://issues.apache.org/jira/browse/SPARK-19270
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Wayne Zhang
>Assignee: Wayne Zhang
>Priority: Minor
>
> Add R-like summary table to GLM summary, which includes feature name (if 
> exist), parameter estimate, standard error, t-stat and p-value. This allows 
> scala users to easily gather these commonly used inference results. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Priority: Critical  (was: Minor)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Priority: Critical
>  Labels: classification, ml
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang reassigned SPARK-21306:
---

Assignee: Yan Facai (颜发才)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Shepherd: Yanbo Liang
Target Version/s: 2.3.0

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Assignee: Yan Facai (颜发才)
>Priority: Critical
>  Labels: classification, ml
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier

2017-07-11 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-21306:

Issue Type: Bug  (was: Improvement)

> OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
> 
>
> Key: SPARK-21306
> URL: https://issues.apache.org/jira/browse/SPARK-21306
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Cathal Garvey
>Priority: Critical
>  Labels: classification, ml
>
> Hi folks, thanks for Spark! :)
> I've been learning to use `ml` and `mllib`, and I've encountered a block 
> while trying to use `ml.classification.OneVsRest` with 
> `ml.classification.LogisticRegression`. Basically, [here in the 
> code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320],
>  only two columns are being extracted and fed to the underlying classifiers.. 
> however with some configurations, more than two columns are required.
> Specifically: I want to do multiclass learning with Logistic Regression, on a 
> very imbalanced dataset. In my dataset, I have lots of imbalances, so I was 
> planning to use weights. I set a column, `"weight"`, as the inverse frequency 
> of each field, and I configured my `LogisticRegression` class to use this 
> column, then put it in a `OneVsRest` wrapper.
> However, `OneVsRest` strips all but two columns out of a dataset before 
> training, so I get an error from within `LogisticRegression` that it can't 
> find the `"weight"` column.
> It would be nice to have this fixed! I can see a few ways, but a very 
> conservative fix would be to include a parameter in `OneVsRest.fit` for 
> additional columns to `select` before passing to the underlying model.
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   4   5   6   7   8   9   10   >