[jira] [Issue Comment Deleted] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Comment: was deleted

(was: User 'MrBago' has created a pull request for this issue:
https://github.com/apache/spark/pull/21344)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24310.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

> Instrumentation for frequent pattern mining
> ---
>
> Key: SPARK-24310
> URL: https://issues.apache.org/jira/browse/SPARK-24310
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479684#comment-16479684
 ] 

Joseph K. Bradley commented on SPARK-24310:
---

The PR for this was linked to the wrong JIRA, but I'm adding the link here for 
the record.

> Instrumentation for frequent pattern mining
> ---
>
> Key: SPARK-24310
> URL: https://issues.apache.org/jira/browse/SPARK-24310
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24310) Instrumentation for frequent pattern mining

2018-05-17 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24310:
-

 Summary: Instrumentation for frequent pattern mining
 Key: SPARK-24310
 URL: https://issues.apache.org/jira/browse/SPARK-24310
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley
Assignee: Bago Amirbekian


See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24114:
-

Assignee: (was: Bago Amirbekian)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Shepherd:   (was: Joseph K. Bradley)

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24114:
--
Shepherd: Joseph K. Bradley

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Bago Amirbekian
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation

2018-05-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24114:
-

Assignee: Bago Amirbekian

> improve instrumentation for spark.ml.recommendation
> ---
>
> Key: SPARK-24114
> URL: https://issues.apache.org/jira/browse/SPARK-24114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Bago Amirbekian
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-16 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478328#comment-16478328
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

[~shahid] Thanks for offering!  If [~wm624] wants to (and has time to) take 
this, then I'd suggest that.  But if not, then please go ahead, thanks!

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2018-05-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22210.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21183
[https://github.com/apache/spark/pull/21183]

> Online LDA variationalTopicInference  should use random seed to have stable 
> behavior
> 
>
> Key: SPARK-22210
> URL: https://issues.apache.org/jira/browse/SPARK-22210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Lu Wang
>Priority: Minor
> Fix For: 2.4.0
>
>
> https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582
> Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-05-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24058.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21153
[https://github.com/apache/spark/pull/21153]

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-05-15 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24058:
-

Assignee: Liang-Chi Hsieh

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470705#comment-16470705
 ] 

Joseph K. Bradley commented on SPARK-24213:
---

On the topic of eating my words, please check out my new comment here: 
[SPARK-15784].  We may need to rework the API.

> Power Iteration Clustering in the SparkML throws exception, when the ID is 
> IntType
> --
>
> Key: SPARK-24213
> URL: https://issues.apache.org/jira/browse/SPARK-24213
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> While running the code, PowerIterationClustering in spark ML throws exception.
> {code:scala}
> val data = spark.createDataFrame(Seq(
> (0, Array(1), Array(0.9)),
> (1, Array(2), Array(0.9)),
> (2, Array(3), Array(0.9)),
> (3, Array(4), Array(0.1)),
> (4, Array(5), Array(0.9))
> )).toDF("id", "neighbors", "similarities")
> val result = new PowerIterationClustering()
> .setK(2)
> .setMaxIter(10)
> .setInitMode("random")
> .transform(data)
> .select("id","prediction")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given 
> input columns: [id, neighbors, similarities];;
> 'Project [id#215, 'prediction]
> +- AnalysisBarrier
>   +- Project [id#215, neighbors#216, similarities#217]
>  +- Join Inner, (id#215 = id#234)
> :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS 
> similarities#217]
> :  +- LocalRelation [_1#209, _2#210, _3#211]
> +- Project [cast(id#230L as int) AS id#234]
>+- LogicalRDD [id#230L, prediction#231], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470704#comment-16470704
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

On the topic of eating my words, please check out my new comment here: 
[SPARK-15784].  We may need to rework the API.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701
 ] 

Joseph K. Bradley edited comment on SPARK-15784 at 5/10/18 4:45 PM:


So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility, not a 
Transformer.  We can have it inherit from Params but not make it a Transformer.

How does this sound?


was (Author: josephkb):
So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility in 
spark.ml.stat.  We can have it inherit from Params but not make it a 
Transformer.

How does this sound?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701
 ] 

Joseph K. Bradley commented on SPARK-15784:
---

So... we originally agreed to make this a Transformer (in the discussion 
above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't 
have this be a Row -> Row Transformer:
* The input data need to have one graph edge pair (i,j) for each edge, not 
duplicated ones (i,j) and (j,i).
* That means that there could be between 0 and numVertices/2 vertices which do 
not have corresponding Rows.

This greatly lessens the value of presenting this as a Transformer.  I 
recommend we rewrite the API before Spark 2.4 and make PIC a utility in 
spark.ml.stat.  We can have it inherit from Params but not make it a 
Transformer.

How does this sound?

> Add Power Iteration Clustering to spark.ml
> --
>
> Key: SPARK-15784
> URL: https://issues.apache.org/jira/browse/SPARK-15784
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Xinh Huynh
>Assignee: Miao Wang
>Priority: Major
>
> Adding this algorithm is required as part of SPARK-4591: Algorithm/model 
> parity for spark.ml. The review JIRA for clustering is SPARK-14380.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-10 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469562#comment-16469562
 ] 

Joseph K. Bradley edited comment on SPARK-24217 at 5/10/18 4:37 PM:


Update: I'll eat my words!  I should have read the docs more carefully (where I 
missed the note that there should be exactly 1 reference from one node to 
another).  This is actually a major problem with our design for PIC, which 
can't really be a Row -> Row Transformer.  Will think more about this and 
re-post.


was (Author: josephkb):
But the reason that the IDs are missing from the "id" column is that the input 
is not symmetric.  If it were made symmetric, then there could not be any 
missing IDs.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469562#comment-16469562
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

But the reason that the IDs are missing from the "id" column is that the input 
is not symmetric.  If it were made symmetric, then there could not be any 
missing IDs.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.  
> Currently PIC is not returning the cluster indices of neighbour IDs which are 
> not there in the ID column.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
>  is a symmetric matrix whose entries are non-negative similarities between 
> items.
>  PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
>  in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
>  containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.

2018-05-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469230#comment-16469230
 ] 

Joseph K. Bradley commented on SPARK-24217:
---

I don't really think this is a bug.  PIC's documentation says pretty clearly 
that the input data has to represent a symmetric matrix, and this example seems 
to be failing because the input data is invalid.  I do think it could be 
valuable to throw a better error when the input is not symmetric, though we 
should make sure that any check we do for this is not too expensive.

> Power Iteration Clustering is not displaying cluster indices corresponding to 
> some vertices.
> 
>
> Key: SPARK-24217
> URL: https://issues.apache.org/jira/browse/SPARK-24217
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> We should display prediction and id corresponding to all the nodes.
> As per the definition of PIC clustering, given in the code,
> PIC takes an affinity matrix between items (or vertices) as input. An 
> affinity matrix
> is a symmetric matrix whose entries are non-negative similarities between 
> items.
> PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each 
> input row includes:
>  * {{idCol}}: vertex ID
>  * {{neighborsCol}}: neighbors of vertex in {{idCol}}
>  * {{similaritiesCol}}: non-negative weights (similarities) of edges between 
> the vertex
> in {{idCol}} and each neighbor in {{neighborsCol}}
>  * *"PIC returns a cluster assignment for each input vertex."* It appends a 
> new column {{predictionCol}}
> containing the cluster assignment in {{[0,k)}} for each row (vertex).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14682.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21097
[https://github.com/apache/spark/pull/21097]

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Minor
> Fix For: 2.4.0
>
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs

2018-05-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14682:
-

Assignee: Weichen Xu

> Provide evaluateEachIteration method or equivalent for spark.ml GBTs
> 
>
> Key: SPARK-14682
> URL: https://issues.apache.org/jira/browse/SPARK-14682
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Minor
>
> spark.mllib GradientBoostedTrees provide an evaluateEachIteration method.  We 
> should provide that or an equivalent for spark.ml.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-05-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-7132:
-
Shepherd: Joseph K. Bradley

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-05-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-7132:


Assignee: Weichen Xu

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType

2018-05-08 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468018#comment-16468018
 ] 

Joseph K. Bradley commented on SPARK-24213:
---

Thanks for reporting this issue!  There is actually a much simpler fix which we 
can do.  Also, the existing unit tests should catch this bug, so those tests 
themselves should be fixed.  I hope you don't mind, but I'd like to go ahead 
and send a patch I wrote while reviewing your PR.

> Power Iteration Clustering in the SparkML throws exception, when the ID is 
> IntType
> --
>
> Key: SPARK-24213
> URL: https://issues.apache.org/jira/browse/SPARK-24213
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: spark_user
>Priority: Major
> Fix For: 2.4.0
>
>
> While running the code, PowerIterationClustering in spark ML throws exception.
> {code:scala}
> val data = spark.createDataFrame(Seq(
> (0, Array(1), Array(0.9)),
> (1, Array(2), Array(0.9)),
> (2, Array(3), Array(0.9)),
> (3, Array(4), Array(0.1)),
> (4, Array(5), Array(0.9))
> )).toDF("id", "neighbors", "similarities")
> val result = new PowerIterationClustering()
> .setK(2)
> .setMaxIter(10)
> .setInitMode("random")
> .transform(data)
> .select("id","prediction")
> {code}
> {code:java}
> org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given 
> input columns: [id, neighbors, similarities];;
> 'Project [id#215, 'prediction]
> +- AnalysisBarrier
>   +- Project [id#215, neighbors#216, similarities#217]
>  +- Join Inner, (id#215 = id#234)
> :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS 
> similarities#217]
> :  +- LocalRelation [_1#209, _2#210, _3#211]
> +- Project [cast(id#230L as int) AS id#234]
>+- LogicalRDD [id#230L, prediction#231], false
>   at 
> org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88)
>   at 
> org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289)
>   at 
> org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
>   at 
> org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24212) PrefixSpan in spark.ml: user guide section

2018-05-08 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24212:
-

 Summary: PrefixSpan in spark.ml: user guide section
 Key: SPARK-24212
 URL: https://issues.apache.org/jira/browse/SPARK-24212
 Project: Spark
  Issue Type: Documentation
  Components: Documentation, ML
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


See linked JIRA for the PrefixSpan API for which we need to write a user guide 
page.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-24145) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley closed SPARK-24145.
-

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24145
> URL: https://issues.apache.org/jira/browse/SPARK-24145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24145) spark.ml parity for sequential pattern mining - PrefixSpan: Python API

2018-05-08 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24145.
---
Resolution: Duplicate

> spark.ml parity for sequential pattern mining - PrefixSpan: Python API
> --
>
> Key: SPARK-24145
> URL: https://issues.apache.org/jira/browse/SPARK-24145
> Project: Spark
>  Issue Type: New Feature
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> spark.ml parity for sequential pattern mining - PrefixSpan: Python API



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-20114.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20973
[https://github.com/apache/spark/pull/20973]

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22885.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20261
[https://github.com/apache/spark/pull/20261]

> ML test for StructuredStreaming: spark.ml.tuning
> 
>
> Key: SPARK-22885
> URL: https://issues.apache.org/jira/browse/SPARK-22885
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-15750.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 13493
[https://github.com/apache/spark/pull/13493]

> Constructing FPGrowth fails when no numPartitions specified in pyspark
> --
>
> Key: SPARK-15750
> URL: https://issues.apache.org/jira/browse/SPARK-15750
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
> Fix For: 2.4.0
>
>
> {code}
> >>> model1 = FPGrowth.train(rdd, 0.6)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, 
> in train
> model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), 
> int(numPartitions))
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 130, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 123, in callJavaFunc
> return _java2py(sc, func(*args))
>   File 
> "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of 
> partitions must be positive but got -1'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem

2018-05-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466513#comment-16466513
 ] 

Joseph K. Bradley commented on SPARK-24152:
---

Thank you all!

> SparkR CRAN feasibility check server problem
> 
>
> Key: SPARK-24152
> URL: https://issues.apache.org/jira/browse/SPARK-24152
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Dongjoon Hyun
>Assignee: Liang-Chi Hsieh
>Priority: Critical
>
> PR builder and master branch test fails with the following SparkR error with 
> unknown reason. The following is an error message from that.
> {code}
> * this is package 'SparkR' version '2.4.0'
> * checking CRAN incoming feasibility ...Error in 
> .check_package_CRAN_incoming(pkgdir) : 
>   dims [product 24] do not match the length of object [0]
> Execution halted
> {code}
> *PR BUILDER*
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/
> - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/
> *MASTER BRANCH*
> - 
> https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/
>  (Fail with no failures)
> This is critical because we already start to merge the PR by ignoring this 
> **known unkonwn** SparkR failure.
> - https://github.com/apache/spark/pull/21175



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree

2018-05-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-24097:
--
Shepherd: Joseph K. Bradley

> Instruments improvements - RandomForest and GradientBoostedTree
> ---
>
> Key: SPARK-24097
> URL: https://issues.apache.org/jira/browse/SPARK-24097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Priority: Major
>
> Instruments improvements - RandomForest and GradientBoostedTree



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree

2018-05-02 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-24097:
-

Assignee: Weichen Xu

> Instruments improvements - RandomForest and GradientBoostedTree
> ---
>
> Key: SPARK-24097
> URL: https://issues.apache.org/jira/browse/SPARK-24097
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>
> Instruments improvements - RandomForest and GradientBoostedTree



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-05-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164
 ] 

Joseph K. Bradley edited comment on SPARK-23686 at 5/2/18 12:21 AM:


[~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs 
on executors.  This brings up the question:
* Should we use Instrumentation on executors?
* What levels of logging should we use on executors (in MLlib algorithms)?

I figure it's safe to assume that executor logs should be more for developers 
than for users.  (Current use in MLlib seems like this, e.g., for training of 
trees in https://github.com/apache/spark/pull/21163 )  These all seem to be at 
the DEBUG level, which is not really useful for users.

(UPDATED BELOW)
Since it'd be handy to have prefixes on executor logs too (to link them with 
Estimators), let's use Instrumentation on executors.



was (Author: josephkb):
[~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs 
on executors.  This brings up the question:
* Should we use Instrumentation on executors?
* What levels of logging should we use on executors (in MLlib algorithms)?

I figure it's safe to assume that executor logs should be more for developers 
than for users.  (Current use in MLlib seems like this, e.g., for training of 
trees in https://github.com/apache/spark/pull/21163 )  These all seem to be at 
the DEBUG level, which is not really useful for users.

Given that, I recommend:
* We leave Instrumentation non-Serializable to avoid use on executors
* We use regular Logging on executors.

Developers who are debugging algorithms will presumably be running pretty 
isolated tests anyways.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-05-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164
 ] 

Joseph K. Bradley edited comment on SPARK-23686 at 5/1/18 9:52 PM:
---

[~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs 
on executors.  This brings up the question:
* Should we use Instrumentation on executors?
* What levels of logging should we use on executors (in MLlib algorithms)?

I figure it's safe to assume that executor logs should be more for developers 
than for users.  (Current use in MLlib seems like this, e.g., for training of 
trees in https://github.com/apache/spark/pull/21163 )  These all seem to be at 
the DEBUG level, which is not really useful for users.

Given that, I recommend:
* We leave Instrumentation non-Serializable to avoid use on executors
* We use regular Logging on executors.

Developers who are debugging algorithms will presumably be running pretty 
isolated tests anyways.


was (Author: josephkb):
[~yogeshgarg] made the good point that we should not convert all uses of 
Logging to use Instrumentation: if logging happens on executors, then we should 
not use the (non-serializable) Instrumentation class.  E.g.: 
https://github.com/apache/spark/blob/6782359a04356e4cde32940861bf2410ef37f445/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1587
Also, these instances all seem to be at the DEBUG level, which is not really 
useful for users.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-15750:
-

Assignee: Jeff Zhang

> Constructing FPGrowth fails when no numPartitions specified in pyspark
> --
>
> Key: SPARK-15750
> URL: https://issues.apache.org/jira/browse/SPARK-15750
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> {code}
> >>> model1 = FPGrowth.train(rdd, 0.6)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, 
> in train
> model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), 
> int(numPartitions))
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 130, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 123, in callJavaFunc
> return _java2py(sc, func(*args))
>   File 
> "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of 
> partitions must be positive but got -1'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark

2018-05-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-15750:
--
Shepherd: Joseph K. Bradley

> Constructing FPGrowth fails when no numPartitions specified in pyspark
> --
>
> Key: SPARK-15750
> URL: https://issues.apache.org/jira/browse/SPARK-15750
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib, PySpark
>Affects Versions: 2.0.0
>Reporter: Jeff Zhang
>Assignee: Jeff Zhang
>Priority: Major
>
> {code}
> >>> model1 = FPGrowth.train(rdd, 0.6)
> Traceback (most recent call last):
>   File "", line 1, in 
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, 
> in train
> model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), 
> int(numPartitions))
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 130, in callMLlibFunc
> return callJavaFunc(sc, api, *args)
>   File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line 
> 123, in callJavaFunc
> return _java2py(sc, func(*args))
>   File 
> "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py",
>  line 933, in __call__
>   File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, 
> in deco
> raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
> pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of 
> partitions must be positive but got -1'
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-05-01 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164
 ] 

Joseph K. Bradley commented on SPARK-23686:
---

[~yogeshgarg] made the good point that we should not convert all uses of 
Logging to use Instrumentation: if logging happens on executors, then we should 
not use the (non-serializable) Instrumentation class.  E.g.: 
https://github.com/apache/spark/blob/6782359a04356e4cde32940861bf2410ef37f445/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1587
Also, these instances all seem to be at the DEBUG level, which is not really 
useful for users.

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2018-05-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22885:
--
Shepherd: Joseph K. Bradley

> ML test for StructuredStreaming: spark.ml.tuning
> 
>
> Key: SPARK-22885
> URL: https://issues.apache.org/jira/browse/SPARK-22885
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning

2018-05-01 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22885:
-

Assignee: Weichen Xu

> ML test for StructuredStreaming: spark.ml.tuning
> 
>
> Key: SPARK-22885
> URL: https://issues.apache.org/jira/browse/SPARK-22885
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24115) improve instrumentation for spark.ml.tuning

2018-04-30 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459033#comment-16459033
 ] 

Joseph K. Bradley commented on SPARK-24115:
---

Sounds good; go ahead.

> improve instrumentation for spark.ml.tuning
> ---
>
> Key: SPARK-24115
> URL: https://issues.apache.org/jira/browse/SPARK-24115
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2018-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-22210:
-

Assignee: Lu Wang

> Online LDA variationalTopicInference  should use random seed to have stable 
> behavior
> 
>
> Key: SPARK-22210
> URL: https://issues.apache.org/jira/browse/SPARK-22210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Lu Wang
>Priority: Minor
>
> https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582
> Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2018-04-27 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22210:
--
Shepherd: Joseph K. Bradley

> Online LDA variationalTopicInference  should use random seed to have stable 
> behavior
> 
>
> Key: SPARK-22210
> URL: https://issues.apache.org/jira/browse/SPARK-22210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582
> Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior

2018-04-25 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453228#comment-16453228
 ] 

Joseph K. Bradley commented on SPARK-22210:
---

[~lu.DB] Would you like to do this?  It should be a matter of taking the "seed" 
Param passed to LDA and making sure it (or a seed generated from it) is passed 
down to this method.  Thanks!

> Online LDA variationalTopicInference  should use random seed to have stable 
> behavior
> 
>
> Key: SPARK-22210
> URL: https://issues.apache.org/jira/browse/SPARK-22210
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Minor
>
> https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582
> Gamma distribution should use random seed to have consistent behavior.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node

2018-04-25 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23824.
---
Resolution: Duplicate

> Make inpurityStats publicly accessible in ml.tree.Node
> --
>
> Key: SPARK-23824
> URL: https://issues.apache.org/jira/browse/SPARK-23824
> Project: Spark
>  Issue Type: Wish
>  Components: ML
>Affects Versions: 2.1.1
>Reporter: Barry Becker
>Priority: Minor
>
> This is minor, but it is also a very easy fix.
> I would like to visualize the structure of a decision tree model, but 
> currently the only means of obtaining the label distribution data at each 
> node of the tree is hidden within each ml.tree.Node inside the impurityStats.
> I'm pretty sure that the fix for this is as easy as removing the private[ml] 
> qualifier from occurrences of
> private[ml] def impurityStats: ImpurityCalculator
> and
> override private[ml] val impurityStats: ImpurityCalculator
>  
> As a workaround, I've put my class that needs access into a  
> org.apache.spark.ml.tree package in my own repository, but I would really 
> like to not have to do that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-20114:
-

Assignee: Weichen Xu

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Assignee: Weichen Xu
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Shepherd: Joseph K. Bradley

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-20114:
--
Target Version/s: 2.4.0

> spark.ml parity for sequential pattern mining - PrefixSpan
> --
>
> Key: SPARK-20114
> URL: https://issues.apache.org/jira/browse/SPARK-20114
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: yuhao yang
>Priority: Major
>
> Creating this jira to track the feature parity for PrefixSpan and sequential 
> pattern mining in Spark ml with DataFrame API. 
> First list a few design issues to be discussed, then subtasks like Scala, 
> Python and R API will be created.
> # Wrapping the MLlib PrefixSpan and provide a generic fit() should be 
> straightforward. Yet PrefixSpan only extracts frequent sequential patterns, 
> which is not good to be used directly for predicting on new records. Please 
> read  
> http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/
>  for some background knowledge. Thanks Philippe Fournier-Viger for providing 
> insights. If we want to keep using the Estimator/Transformer pattern, options 
> are:
>  #*  Implement a dummy transform for PrefixSpanModel, which will not add 
> new column to the input DataSet. The PrefixSpanModel is only used to provide 
> access for frequent sequential patterns.
>  #*  Adding the feature to extract sequential rules from sequential 
> patterns. Then use the sequential rules in the transform as FPGrowthModel.  
> The rules extracted are of the form X–> Y where X and Y are sequential 
> patterns. But in practice, these rules are not very good as they are too 
> precise and thus not noise tolerant.
> #  Different from association rules and frequent itemsets, sequential rules 
> can be extracted from the original dataset more efficiently using algorithms 
> like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is 
> unordered, but X must appear before Y, which is more general and can work 
> better in practice for prediction. 
> I'd like to hear more from the users to see which kind of Sequential rules 
> are more practical. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23990.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21078
[https://github.com/apache/spark/pull/21078]

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23455) Default Params in ML should be saved separately

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23455.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20633
[https://github.com/apache/spark/pull/20633]

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.4.0
>
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450161#comment-16450161
 ] 

Joseph K. Bradley commented on SPARK-23975:
---

I merged https://github.com/apache/spark/pull/21081 for KMeans, and [~lu.DB] 
will follow up for the other algs.

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-04-24 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23975:
-

Assignee: Lu Wang

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23455) Default Params in ML should be saved separately

2018-04-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23455:
--
Target Version/s: 2.4.0

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23455) Default Params in ML should be saved separately

2018-04-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23455:
-

Assignee: Liang-Chi Hsieh

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448994#comment-16448994
 ] 

Joseph K. Bradley commented on SPARK-24058:
---

CCing [~viirya] since you're the natural one to take this.  Thanks!

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24058:
-

 Summary: Default Params in ML should be saved separately: Python 
API
 Key: SPARK-24058
 URL: https://issues.apache.org/jira/browse/SPARK-24058
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed in 
Scala, we must change it for Python for Spark 2.4.0 as well in order to keep 
the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23990:
--
Shepherd: Joseph K. Bradley

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23990:
-

Assignee: Weichen Xu

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Assignee: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-24026.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21090
[https://github.com/apache/spark/pull/21090]

> spark.ml Scala/Java API for PIC
> ---
>
> Key: SPARK-24026
> URL: https://issues.apache.org/jira/browse/SPARK-24026
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Miao Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24026) spark.ml Scala/Java API for PIC

2018-04-19 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24026:
-

 Summary: spark.ml Scala/Java API for PIC
 Key: SPARK-24026
 URL: https://issues.apache.org/jira/browse/SPARK-24026
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.3.0
Reporter: Joseph K. Bradley
Assignee: Miao Wang


See parent JIRA



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441713#comment-16441713
 ] 

Joseph K. Bradley commented on SPARK-18693:
---

[~imatiach] Would you mind creating JIRA subtasks so that we have 1 PR per 
JIRA?  That helps with tracking.  Thanks!

> BinaryClassificationEvaluator, RegressionEvaluator, and 
> MulticlassClassificationEvaluator should use sample weight data
> ---
>
> Key: SPARK-18693
> URL: https://issues.apache.org/jira/browse/SPARK-18693
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.0.2
>Reporter: Devesh Parekh
>Priority: Major
>
> The LogisticRegression and LinearRegression models support training with a 
> weight column, but the corresponding evaluators do not support computing 
> metrics using those weights. This breaks model selection using CrossValidator.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441701#comment-16441701
 ] 

Joseph K. Bradley commented on SPARK-23990:
---

A complication was brought up by this PR: Some logging occurs in classes which 
are not Estimators (WeightedLeastSquares, IterativelyReweightedLeastSquares) 
and in static objects (RandomForest, GradientBoostedTrees).  These may have an 
Instrumentation instance available (when used from an Estimator) or may not 
(when used in a unit test).  Options include:
1. Make these require Instrumentation instances.  This would require slightly 
awkward changes to unit tests.
2. Create something similar to Instrumentation or Logging which can store an 
Optional Instrumentation instance.  If the Instrumentation is available, it can 
log via that; otherwise, it can call into regular Logging.
2a. This could be a trait like Logging.  This is nice in that it requires fewer 
changes to existing logging code.
2b. This could be a class like Instrumentation.  This is nice in that it 
standardizes all of MLlib around Instrumentation instead of Logging.

I'd vote for 2b to standardize what we do in MLlib.  Thoughts?

> Instruments logging improvements - ML regression package
> 
>
> Key: SPARK-23990
> URL: https://issues.apache.org/jira/browse/SPARK-23990
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
> Environment: Instruments logging improvements - ML regression package
>Reporter: Weichen Xu
>Priority: Major
>   Original Estimate: 120h
>  Remaining Estimate: 120h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22884:
--
Shepherd: Joseph K. Bradley

> ML test for StructuredStreaming: spark.ml.clustering
> 
>
> Key: SPARK-22884
> URL: https://issues.apache.org/jira/browse/SPARK-22884
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Shepherd: Joseph K. Bradley

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441207#comment-16441207
 ] 

Joseph K. Bradley commented on SPARK-8799:
--

The missing functionality was added in [SPARK-9312], but we cannot fix this 
JIRA until 3.0.0 since it will require breaking APIs (changing OneVsRest's 
inheritance structure and supported FeatureTypes).  Let's target this fix for 
3.0.0, for which I'll recommend:
* Rename current OneVsRest to GenericOneVsRest or something like that.  Have it 
inherit from Classifier and take a type parameter for FeaturesType.
* Add a specialization of GenericOneVsRest with fixed FeaturesType = VectorUDT, 
and call this new one OneVsRest.
I _think_ that will avoid breaking most user code (but I have not thought it 
through carefully).

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-8799:
-
Target Version/s: 3.0.0

> OneVsRestModel should extend ClassificationModel
> 
>
> Key: SPARK-8799
> URL: https://issues.apache.org/jira/browse/SPARK-8799
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Feynman Liang
>Priority: Minor
>
> Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. 
> For example:
>  * `accColName` can be used to populate `ClassificationModel#predictRaw` and 
> share implementations of `transform`
>  * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be 
> gotten for free through subclassing
> `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) 
> because the labels for a `OneVsRest` will always be discrete and finite.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21741:
-

Assignee: Weichen Xu

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21741) Python API for DataFrame-based multivariate summarizer

2018-04-17 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21741.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20695
[https://github.com/apache/spark/pull/20695]

> Python API for DataFrame-based multivariate summarizer
> --
>
> Key: SPARK-21741
> URL: https://issues.apache.org/jira/browse/SPARK-21741
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Yanbo Liang
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> We support multivariate summarizer for DataFrame API at SPARK-19634, we 
> should also make PySpark support it. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23975) Allow Clustering to take Arrays of Double as input features

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23975:
--
Shepherd: Joseph K. Bradley

> Allow Clustering to take Arrays of Double as input features
> ---
>
> Key: SPARK-23975
> URL: https://issues.apache.org/jira/browse/SPARK-23975
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
>
> Clustering algorithms should accept Arrays in addition to Vectors as input 
> features. The python interface should also be changed so that it would make 
> PySpark a lot easier to use. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21088) CrossValidator, TrainValidationSplit should collect all models when fitting: Python API

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-21088.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19627
[https://github.com/apache/spark/pull/19627]

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Python API
> ---
>
> Key: SPARK-21088
> URL: https://issues.apache.org/jira/browse/SPARK-21088
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> In pyspark:
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21088) CrossValidator, TrainValidationSplit should collect all models when fitting: Python API

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-21088:
-

Assignee: Weichen Xu

> CrossValidator, TrainValidationSplit should collect all models when fitting: 
> Python API
> ---
>
> Key: SPARK-21088
> URL: https://issues.apache.org/jira/browse/SPARK-21088
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.2.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> In pyspark:
> We add a parameter whether to collect the full model list when 
> CrossValidator/TrainValidationSplit training (Default is NOT, avoid the 
> change cause OOM)
> Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to 
> get the model list
> CrossValidatorModelWriter add a “option”, allow user to control whether to 
> persist the model list to disk.
> Note: when persisting the model list, use indices as the sub-model path



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-9312) The OneVsRest model does not provide rawPrediction

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-9312:
-
Summary: The OneVsRest model does not provide rawPrediction  (was: The 
OneVsRest model does not provide confidence factor(not probability) along with 
the prediction)

> The OneVsRest model does not provide rawPrediction
> --
>
> Key: SPARK-9312
> URL: https://issues.apache.org/jira/browse/SPARK-9312
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Badari Madhav
>Assignee: Lu Wang
>Priority: Major
>  Labels: features
> Fix For: 2.4.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-9312:


Assignee: Lu Wang

> The OneVsRest model does not provide confidence factor(not probability) along 
> with the prediction
> -
>
> Key: SPARK-9312
> URL: https://issues.apache.org/jira/browse/SPARK-9312
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Badari Madhav
>Assignee: Lu Wang
>Priority: Major
>  Labels: features
> Fix For: 2.4.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction

2018-04-16 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-9312.
--
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21044
[https://github.com/apache/spark/pull/21044]

> The OneVsRest model does not provide confidence factor(not probability) along 
> with the prediction
> -
>
> Key: SPARK-9312
> URL: https://issues.apache.org/jira/browse/SPARK-9312
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, MLlib
>Affects Versions: 1.4.0, 1.4.1
>Reporter: Badari Madhav
>Assignee: Lu Wang
>Priority: Major
>  Labels: features
> Fix For: 2.4.0
>
>   Original Estimate: 72h
>  Remaining Estimate: 72h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22883.
---
   Resolution: Fixed
Fix Version/s: 2.3.1

Issue resolved by pull request 21042
[https://github.com/apache/spark/pull/21042]

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0, 2.3.1
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Fix Version/s: 2.4.0

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M

2018-04-11 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22883:
--
Target Version/s: 2.3.1, 2.4.0

> ML test for StructuredStreaming: spark.ml.feature, A-M
> --
>
> Key: SPARK-22883
> URL: https://issues.apache.org/jira/browse/SPARK-22883
> Project: Spark
>  Issue Type: Test
>  Components: ML, Tests
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Major
> Fix For: 2.4.0
>
>
> *For featurizers with names from A - M*
> Task for adding Structured Streaming tests for all Models/Transformers in a 
> sub-module in spark.ml
> For an example, see LinearRegressionSuite.scala in 
> https://github.com/apache/spark/pull/19843



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-19947.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

I'll mark this as complete.  Those earlier PRs fixed some issues, and 
[SPARK-23562] should fix the rest.

> RFormulaModel always throws Exception on transforming data with NULL or 
> Unseen labels
> -
>
> Key: SPARK-19947
> URL: https://issues.apache.org/jira/browse/SPARK-19947
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Andrey Yatsuk
>Priority: Major
> Fix For: 2.4.0
>
>
> I have trained ML model and big data table in parquet. I want add new column 
> to this table with predicted values. I can't lose any data, but can having 
> null values in it.
> RFormulaModel.fit() method creates new StringIndexer with default 
> (handleInvalid="error") parameter. Also VectorAssembler on NULL values 
> throwing Exception. So I must call df.na.drop() to transform this DataFrame 
> and I don't want to do this.
> Need add to RFormula new parameter like handleInvalid in StringIndexer.
> Or add transform(Seq): Vector method which user can use as UDF method 
> in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, 
> Seq))



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23562.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

I think everything has been fixed, so I'll close this.  Thanks [~yogeshgarg] 
and [~huaxingao]!

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.
> --> Discussed offline: null values can be converted to NaN values for "keep"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23562:
--
Shepherd: Joseph K. Bradley

> RFormula handleInvalid should handle invalid values in non-string columns.
> --
>
> Key: SPARK-23562
> URL: https://issues.apache.org/jira/browse/SPARK-23562
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently when handleInvalid is set to 'keep' or 'skip' this only applies to 
> String fields. Numeric fields that are null will either cause the transformer 
> to fail or might be null in the resulting label column.
> I'm not sure what the semantics of keep might be for numeric columns with 
> null values, but we should be able to at least support skip for these types.
> --> Discussed offline: null values can be converted to NaN values for "keep"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23944.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21015
[https://github.com/apache/spark/pull/21015]

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
> Fix For: 2.4.0
>
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23944:
-

Assignee: Lu Wang

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Assignee: Lu Wang
>Priority: Major
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23871.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21003
[https://github.com/apache/spark/pull/21003]

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23871:
--
Shepherd: Joseph K. Bradley

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23871) add python api for VectorAssembler handleInvalid

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23871:
-

Assignee: Huaxin Gao

> add python api for VectorAssembler handleInvalid
> 
>
> Key: SPARK-23871
> URL: https://issues.apache.org/jira/browse/SPARK-23871
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: Huaxin Gao
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-21856:
--
Fix Version/s: 2.3.0

> Update Python API for MultilayerPerceptronClassifierModel
> -
>
> Key: SPARK-21856
> URL: https://issues.apache.org/jira/browse/SPARK-21856
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.3.0
>Reporter: Weichen Xu
>Assignee: Chunsheng Ji
>Priority: Minor
> Fix For: 2.3.0
>
>
> SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so 
> python API also need update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23751:
-

Assignee: Weichen Xu

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-10 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23751.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20904
[https://github.com/apache/spark/pull/20904]

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23944) Add Param set functions to LSHModel types

2018-04-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23944:
--
Fix Version/s: (was: 2.4.0)

> Add Param set functions to LSHModel types
> -
>
> Key: SPARK-23944
> URL: https://issues.apache.org/jira/browse/SPARK-23944
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib
>Affects Versions: 2.3.0
>Reporter: Lu Wang
>Priority: Major
>
> 2 param set methods ( setInputCol, setOutputCol) are added to the two 
> LSHModel types for min hash and random projections.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-14681.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 20786
[https://github.com/apache/spark/pull/20786]

> Provide label/impurity stats for spark.ml decision tree nodes
> -
>
> Key: SPARK-14681
> URL: https://issues.apache.org/jira/browse/SPARK-14681
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> Currently, spark.ml decision trees provide all node info except for the 
> aggregated stats about labels and impurities.  This task is to provide those 
> publicly.  We need to choose a good API for it, so we should discuss the 
> design on this issue before implementing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes

2018-04-09 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-14681:
-

Assignee: Weichen Xu

> Provide label/impurity stats for spark.ml decision tree nodes
> -
>
> Key: SPARK-14681
> URL: https://issues.apache.org/jira/browse/SPARK-14681
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
>
> Currently, spark.ml decision trees provide all node info except for the 
> aggregated stats about labels and impurities.  This task is to provide those 
> publicly.  We need to choose a good API for it, so we should discuss the 
> design on this issue before implementing it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly

2018-04-09 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431079#comment-16431079
 ] 

Joseph K. Bradley commented on SPARK-21005:
---

I don't actually see why this is a problem: If a feature is categorical, we 
should not silently convert it to continuous.  To use a high-arity categorical 
feature in a decision tree, one should convert it to a different representation 
first, such as hashing to a set of bins with HashingTF.

That said, I do think we should clarify this behavior in the VectorIndexer 
docstring.  I know it's been a long time since you sent your PR, but would you 
want to update it to simply update the docs?  If you're busy now, I'd be happy 
to take it over though.  Thanks!

> VectorIndexerModel does not prepare output column field correctly
> -
>
> Key: SPARK-21005
> URL: https://issues.apache.org/jira/browse/SPARK-21005
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 2.1.1
>Reporter: Chen Lin
>Priority: Major
>
> From my understanding through reading the documentation,  VectorIndexer 
> decides which features should be categorical based on the number of distinct 
> values, where features with at most maxCategories are declared categorical. 
> Meanwhile, those features which exceed maxCategories are declared continuous. 
> Currently, VectorIndexerModel works all right with a dataset which has empty 
> schema. However, when VectorIndexerModel is transforming on a dataset with 
> `ML_ATTR` metadata, it may not output the expected result. For example, a 
> feature with nominal attribute which has distinct values exceeding 
> maxCategorie will not be treated as a continuous feature as we expected but 
> still a categorical feature. Thus, it may cause all the tree-based algorithms 
> (like Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree 
> requires maxBins (= $maxPossibleBins) to be at least as large as the number 
> of values in each categorical feature, but categorical feature $maxCategory 
> has $maxCategoriesPerFeature values. Considering remove this and other 
> categorical features with a large number of values, or add more training 
> examples.".
> Correct me if my understanding is wrong.
> I will submit a PR soon to resolve this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"

2018-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429134#comment-16429134
 ] 

Joseph K. Bradley commented on SPARK-18092:
---

Can you please add a description and make the title more specific?  Currently, 
it's unclear what this JIRA is addressing without looking at the code.  Thanks!

> add type cast to avoid error "Column prediction must be of type DoubleType 
> but was actually FloatType"
> --
>
> Key: SPARK-18092
> URL: https://issues.apache.org/jira/browse/SPARK-18092
> Project: Spark
>  Issue Type: Bug
>Reporter: albert fang
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml

2018-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23751:
--
Shepherd: Joseph K. Bradley

> Kolmogorov-Smirnoff test Python API in pyspark.ml
> -
>
> Key: SPARK-23751
> URL: https://issues.apache.org/jira/browse/SPARK-23751
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> Python wrapper for new DataFrame-based API for KS test



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23859) Initial PR for Instrumentation improvements: UUID and logging levels

2018-04-06 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23859.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved with https://github.com/apache/spark/pull/20982

> Initial PR for Instrumentation improvements: UUID and logging levels
> 
>
> Key: SPARK-23859
> URL: https://issues.apache.org/jira/browse/SPARK-23859
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Joseph K. Bradley
>Assignee: Weichen Xu
>Priority: Major
> Fix For: 2.4.0
>
>
> This is a subtask for an initial PR to improve MLlib's Instrumentation class 
> for logging.  It will address a couple of issues and use the changes in 
> LogisticRegression as an example class.
> Issues:
> * The UUID is currently generated from an atomic integer.  This is a problem 
> since the integer is reset whenever a persisted Estimator is loaded on a new 
> cluster.  We should just use a random UUID to get a new UUID each time with 
> high probability.
> * We use both Instrumentation and Logging to log stuff.  Let's standardize 
> around Instrumentation in MLlib since it can associate logs with the 
> Estimator or Transformer which produced the logs (via a prefix with the 
> algorithm's name or UUID).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation

2018-04-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428612#comment-16428612
 ] 

Joseph K. Bradley commented on SPARK-23686:
---

I wanted to ping some other active MLlib committers since this will change 
logging in MLlib.  The main change will be to prefix logged messages with a 
string included a unique identifier for the algorithm.  That will make it 
easier to associate log messages with Pipeline stages; this is hard right now, 
e.g., if there are multiple StringIndexers in the same Pipeline.fit() call.
CC [~mlnick], [~holdenk], [~dbtsai], [~yanboliang], [~sethah]

> Make better usage of org.apache.spark.ml.util.Instrumentation
> -
>
> Key: SPARK-23686
> URL: https://issues.apache.org/jira/browse/SPARK-23686
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Priority: Major
>
> This Jira is a bit high level and might require subtasks or other jiras for 
> more specific tasks.
> I've noticed that we don't make the best usage of the instrumentation class. 
> Specifically sometimes we bypass the instrumentation class and use the 
> debugger instead. For example, 
> [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143]
> Also there are some things that might be useful to log in the instrumentation 
> class that we currently don't. For example:
> number of training examples
> mean/var of label (regression)
> I know computing these things can be expensive in some cases, but especially 
> when this data is already available we can log it for free. For example, 
> Logistic Regression Summarizer computes some useful data including numRows 
> that we don't log.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler

2018-04-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-23870.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

Resolved via https://github.com/apache/spark/pull/20970

>  Forward RFormula handleInvalid Param to VectorAssembler
> 
>
> Key: SPARK-23870
> URL: https://issues.apache.org/jira/browse/SPARK-23870
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: yogesh garg
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler

2018-04-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23870:
-

Assignee: yogesh garg

>  Forward RFormula handleInvalid Param to VectorAssembler
> 
>
> Key: SPARK-23870
> URL: https://issues.apache.org/jira/browse/SPARK-23870
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Assignee: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler

2018-04-05 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23870:
--
Fix Version/s: (was: 2.4.0)

>  Forward RFormula handleInvalid Param to VectorAssembler
> 
>
> Key: SPARK-23870
> URL: https://issues.apache.org/jira/browse/SPARK-23870
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: yogesh garg
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-22667) Fix model-specific optimization support for ML tuning: Python API

2018-04-03 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-22667.
---
   Resolution: Duplicate
Fix Version/s: 2.3.0

> Fix model-specific optimization support for ML tuning: Python API
> -
>
> Key: SPARK-22667
> URL: https://issues.apache.org/jira/browse/SPARK-22667
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Weichen Xu
>Priority: Major
> Fix For: 2.3.0
>
>
> Fix model-specific optimization support for ML tuning: Python API
> See explanation here
> https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



<    1   2   3   4   5   6   7   8   9   10   >