[jira] [Issue Comment Deleted] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Comment: was deleted (was: User 'MrBago' has created a pull request for this issue: https://github.com/apache/spark/pull/21344) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24310) Instrumentation for frequent pattern mining
[ https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24310. --- Resolution: Fixed Fix Version/s: 2.4.0 > Instrumentation for frequent pattern mining > --- > > Key: SPARK-24310 > URL: https://issues.apache.org/jira/browse/SPARK-24310 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24310) Instrumentation for frequent pattern mining
[ https://issues.apache.org/jira/browse/SPARK-24310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16479684#comment-16479684 ] Joseph K. Bradley commented on SPARK-24310: --- The PR for this was linked to the wrong JIRA, but I'm adding the link here for the record. > Instrumentation for frequent pattern mining > --- > > Key: SPARK-24310 > URL: https://issues.apache.org/jira/browse/SPARK-24310 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24310) Instrumentation for frequent pattern mining
Joseph K. Bradley created SPARK-24310: - Summary: Instrumentation for frequent pattern mining Key: SPARK-24310 URL: https://issues.apache.org/jira/browse/SPARK-24310 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Assignee: Bago Amirbekian See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24114: - Assignee: (was: Bago Amirbekian) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Shepherd: (was: Joseph K. Bradley) > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24114: -- Shepherd: Joseph K. Bradley > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Bago Amirbekian >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24114) improve instrumentation for spark.ml.recommendation
[ https://issues.apache.org/jira/browse/SPARK-24114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24114: - Assignee: Bago Amirbekian > improve instrumentation for spark.ml.recommendation > --- > > Key: SPARK-24114 > URL: https://issues.apache.org/jira/browse/SPARK-24114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Bago Amirbekian >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16478328#comment-16478328 ] Joseph K. Bradley commented on SPARK-15784: --- [~shahid] Thanks for offering! If [~wm624] wants to (and has time to) take this, then I'd suggest that. But if not, then please go ahead, thanks! > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
[ https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22210. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21183 [https://github.com/apache/spark/pull/21183] > Online LDA variationalTopicInference should use random seed to have stable > behavior > > > Key: SPARK-22210 > URL: https://issues.apache.org/jira/browse/SPARK-22210 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Lu Wang >Priority: Minor > Fix For: 2.4.0 > > > https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 > Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24058) Default Params in ML should be saved separately: Python API
[ https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24058. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21153 [https://github.com/apache/spark/pull/21153] > Default Params in ML should be saved separately: Python API > --- > > Key: SPARK-24058 > URL: https://issues.apache.org/jira/browse/SPARK-24058 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > See [SPARK-23455] for reference. Since DefaultParamsReader has been changed > in Scala, we must change it for Python for Spark 2.4.0 as well in order to > keep the 2 in sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24058) Default Params in ML should be saved separately: Python API
[ https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24058: - Assignee: Liang-Chi Hsieh > Default Params in ML should be saved separately: Python API > --- > > Key: SPARK-24058 > URL: https://issues.apache.org/jira/browse/SPARK-24058 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > See [SPARK-23455] for reference. Since DefaultParamsReader has been changed > in Scala, we must change it for Python for Spark 2.4.0 as well in order to > keep the 2 in sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType
[ https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470705#comment-16470705 ] Joseph K. Bradley commented on SPARK-24213: --- On the topic of eating my words, please check out my new comment here: [SPARK-15784]. We may need to rework the API. > Power Iteration Clustering in the SparkML throws exception, when the ID is > IntType > -- > > Key: SPARK-24213 > URL: https://issues.apache.org/jira/browse/SPARK-24213 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > While running the code, PowerIterationClustering in spark ML throws exception. > {code:scala} > val data = spark.createDataFrame(Seq( > (0, Array(1), Array(0.9)), > (1, Array(2), Array(0.9)), > (2, Array(3), Array(0.9)), > (3, Array(4), Array(0.1)), > (4, Array(5), Array(0.9)) > )).toDF("id", "neighbors", "similarities") > val result = new PowerIterationClustering() > .setK(2) > .setMaxIter(10) > .setInitMode("random") > .transform(data) > .select("id","prediction") > {code} > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given > input columns: [id, neighbors, similarities];; > 'Project [id#215, 'prediction] > +- AnalysisBarrier > +- Project [id#215, neighbors#216, similarities#217] > +- Join Inner, (id#215 = id#234) > :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS > similarities#217] > : +- LocalRelation [_1#209, _2#210, _3#211] > +- Project [cast(id#230L as int) AS id#234] >+- LogicalRDD [id#230L, prediction#231], false > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.
[ https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470704#comment-16470704 ] Joseph K. Bradley commented on SPARK-24217: --- On the topic of eating my words, please check out my new comment here: [SPARK-15784]. We may need to rework the API. > Power Iteration Clustering is not displaying cluster indices corresponding to > some vertices. > > > Key: SPARK-24217 > URL: https://issues.apache.org/jira/browse/SPARK-24217 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > We should display prediction and id corresponding to all the nodes. > Currently PIC is not returning the cluster indices of neighbour IDs which are > not there in the ID column. > As per the definition of PIC clustering, given in the code, > PIC takes an affinity matrix between items (or vertices) as input. An > affinity matrix > is a symmetric matrix whose entries are non-negative similarities between > items. > PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each > input row includes: > * {{idCol}}: vertex ID > * {{neighborsCol}}: neighbors of vertex in {{idCol}} > * {{similaritiesCol}}: non-negative weights (similarities) of edges between > the vertex > in {{idCol}} and each neighbor in {{neighborsCol}} > * *"PIC returns a cluster assignment for each input vertex."* It appends a > new column {{predictionCol}} > containing the cluster assignment in {{[0,k)}} for each row (vertex). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701 ] Joseph K. Bradley edited comment on SPARK-15784 at 5/10/18 4:45 PM: So... we originally agreed to make this a Transformer (in the discussion above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't have this be a Row -> Row Transformer: * The input data need to have one graph edge pair (i,j) for each edge, not duplicated ones (i,j) and (j,i). * That means that there could be between 0 and numVertices/2 vertices which do not have corresponding Rows. This greatly lessens the value of presenting this as a Transformer. I recommend we rewrite the API before Spark 2.4 and make PIC a utility, not a Transformer. We can have it inherit from Params but not make it a Transformer. How does this sound? was (Author: josephkb): So... we originally agreed to make this a Transformer (in the discussion above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't have this be a Row -> Row Transformer: * The input data need to have one graph edge pair (i,j) for each edge, not duplicated ones (i,j) and (j,i). * That means that there could be between 0 and numVertices/2 vertices which do not have corresponding Rows. This greatly lessens the value of presenting this as a Transformer. I recommend we rewrite the API before Spark 2.4 and make PIC a utility in spark.ml.stat. We can have it inherit from Params but not make it a Transformer. How does this sound? > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-15784) Add Power Iteration Clustering to spark.ml
[ https://issues.apache.org/jira/browse/SPARK-15784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470701#comment-16470701 ] Joseph K. Bradley commented on SPARK-15784: --- So... we originally agreed to make this a Transformer (in the discussion above), but [SPARK-24213] and [SPARK-24217] brought up the issue that we can't have this be a Row -> Row Transformer: * The input data need to have one graph edge pair (i,j) for each edge, not duplicated ones (i,j) and (j,i). * That means that there could be between 0 and numVertices/2 vertices which do not have corresponding Rows. This greatly lessens the value of presenting this as a Transformer. I recommend we rewrite the API before Spark 2.4 and make PIC a utility in spark.ml.stat. We can have it inherit from Params but not make it a Transformer. How does this sound? > Add Power Iteration Clustering to spark.ml > -- > > Key: SPARK-15784 > URL: https://issues.apache.org/jira/browse/SPARK-15784 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Xinh Huynh >Assignee: Miao Wang >Priority: Major > > Adding this algorithm is required as part of SPARK-4591: Algorithm/model > parity for spark.ml. The review JIRA for clustering is SPARK-14380. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.
[ https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469562#comment-16469562 ] Joseph K. Bradley edited comment on SPARK-24217 at 5/10/18 4:37 PM: Update: I'll eat my words! I should have read the docs more carefully (where I missed the note that there should be exactly 1 reference from one node to another). This is actually a major problem with our design for PIC, which can't really be a Row -> Row Transformer. Will think more about this and re-post. was (Author: josephkb): But the reason that the IDs are missing from the "id" column is that the input is not symmetric. If it were made symmetric, then there could not be any missing IDs. > Power Iteration Clustering is not displaying cluster indices corresponding to > some vertices. > > > Key: SPARK-24217 > URL: https://issues.apache.org/jira/browse/SPARK-24217 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > We should display prediction and id corresponding to all the nodes. > Currently PIC is not returning the cluster indices of neighbour IDs which are > not there in the ID column. > As per the definition of PIC clustering, given in the code, > PIC takes an affinity matrix between items (or vertices) as input. An > affinity matrix > is a symmetric matrix whose entries are non-negative similarities between > items. > PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each > input row includes: > * {{idCol}}: vertex ID > * {{neighborsCol}}: neighbors of vertex in {{idCol}} > * {{similaritiesCol}}: non-negative weights (similarities) of edges between > the vertex > in {{idCol}} and each neighbor in {{neighborsCol}} > * *"PIC returns a cluster assignment for each input vertex."* It appends a > new column {{predictionCol}} > containing the cluster assignment in {{[0,k)}} for each row (vertex). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.
[ https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469562#comment-16469562 ] Joseph K. Bradley commented on SPARK-24217: --- But the reason that the IDs are missing from the "id" column is that the input is not symmetric. If it were made symmetric, then there could not be any missing IDs. > Power Iteration Clustering is not displaying cluster indices corresponding to > some vertices. > > > Key: SPARK-24217 > URL: https://issues.apache.org/jira/browse/SPARK-24217 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > We should display prediction and id corresponding to all the nodes. > Currently PIC is not returning the cluster indices of neighbour IDs which are > not there in the ID column. > As per the definition of PIC clustering, given in the code, > PIC takes an affinity matrix between items (or vertices) as input. An > affinity matrix > is a symmetric matrix whose entries are non-negative similarities between > items. > PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each > input row includes: > * {{idCol}}: vertex ID > * {{neighborsCol}}: neighbors of vertex in {{idCol}} > * {{similaritiesCol}}: non-negative weights (similarities) of edges between > the vertex > in {{idCol}} and each neighbor in {{neighborsCol}} > * *"PIC returns a cluster assignment for each input vertex."* It appends a > new column {{predictionCol}} > containing the cluster assignment in {{[0,k)}} for each row (vertex). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24217) Power Iteration Clustering is not displaying cluster indices corresponding to some vertices.
[ https://issues.apache.org/jira/browse/SPARK-24217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469230#comment-16469230 ] Joseph K. Bradley commented on SPARK-24217: --- I don't really think this is a bug. PIC's documentation says pretty clearly that the input data has to represent a symmetric matrix, and this example seems to be failing because the input data is invalid. I do think it could be valuable to throw a better error when the input is not symmetric, though we should make sure that any check we do for this is not too expensive. > Power Iteration Clustering is not displaying cluster indices corresponding to > some vertices. > > > Key: SPARK-24217 > URL: https://issues.apache.org/jira/browse/SPARK-24217 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > We should display prediction and id corresponding to all the nodes. > As per the definition of PIC clustering, given in the code, > PIC takes an affinity matrix between items (or vertices) as input. An > affinity matrix > is a symmetric matrix whose entries are non-negative similarities between > items. > PIC takes this matrix (or graph) as an adjacency matrix. Specifically, each > input row includes: > * {{idCol}}: vertex ID > * {{neighborsCol}}: neighbors of vertex in {{idCol}} > * {{similaritiesCol}}: non-negative weights (similarities) of edges between > the vertex > in {{idCol}} and each neighbor in {{neighborsCol}} > * *"PIC returns a cluster assignment for each input vertex."* It appends a > new column {{predictionCol}} > containing the cluster assignment in {{[0,k)}} for each row (vertex). > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[ https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14682. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21097 [https://github.com/apache/spark/pull/21097] > Provide evaluateEachIteration method or equivalent for spark.ml GBTs > > > Key: SPARK-14682 > URL: https://issues.apache.org/jira/browse/SPARK-14682 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.4.0 > > > spark.mllib GradientBoostedTrees provide an evaluateEachIteration method. We > should provide that or an equivalent for spark.ml. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14682) Provide evaluateEachIteration method or equivalent for spark.ml GBTs
[ https://issues.apache.org/jira/browse/SPARK-14682?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-14682: - Assignee: Weichen Xu > Provide evaluateEachIteration method or equivalent for spark.ml GBTs > > > Key: SPARK-14682 > URL: https://issues.apache.org/jira/browse/SPARK-14682 > Project: Spark > Issue Type: New Feature > Components: ML >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Minor > > spark.mllib GradientBoostedTrees provide an evaluateEachIteration method. We > should provide that or an equivalent for spark.ml. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-7132: - Shepherd: Joseph K. Bradley > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. > Goals > A [P0] Support efficient validation during training > B [P1] Support early stopping based on validation metrics > C [P0] Ensure validation data are preprocessed identically to training data > D [P1] Support complex Pipelines with multiple models using validation data > Proposal: column with indicator for train vs validation > Include an extra column in the input DataFrame which indicates whether the > row is for training or validation. Add a Param “validationFlagCol” used to > specify the extra column name. > A, B, C are easy. > D is doable. > Each estimator would need to have its validationFlagCol Param set to the same > column. > Complication: It would be ideal if we could prevent different estimators from > using different validation sets. (Joseph: There is not an obvious way IMO. > Maybe we can address this later by, e.g., having Pipelines take a > validationFlagCol Param and pass that to the sub-models in the Pipeline. > Let’s not worry about this for now.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7132) Add fit with validation set to spark.ml GBT
[ https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-7132: Assignee: Weichen Xu > Add fit with validation set to spark.ml GBT > --- > > Key: SPARK-7132 > URL: https://issues.apache.org/jira/browse/SPARK-7132 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Minor > > In spark.mllib GradientBoostedTrees, we have a method runWithValidation which > takes a validation set. We should add that to the spark.ml API. > This will require a bit of thinking about how the Pipelines API should handle > a validation set (since Transformers and Estimators only take 1 input > DataFrame). The current plan is to include an extra column in the input > DataFrame which indicates whether the row is for training, validation, etc. > Goals > A [P0] Support efficient validation during training > B [P1] Support early stopping based on validation metrics > C [P0] Ensure validation data are preprocessed identically to training data > D [P1] Support complex Pipelines with multiple models using validation data > Proposal: column with indicator for train vs validation > Include an extra column in the input DataFrame which indicates whether the > row is for training or validation. Add a Param “validationFlagCol” used to > specify the extra column name. > A, B, C are easy. > D is doable. > Each estimator would need to have its validationFlagCol Param set to the same > column. > Complication: It would be ideal if we could prevent different estimators from > using different validation sets. (Joseph: There is not an obvious way IMO. > Maybe we can address this later by, e.g., having Pipelines take a > validationFlagCol Param and pass that to the sub-models in the Pipeline. > Let’s not worry about this for now.) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24213) Power Iteration Clustering in the SparkML throws exception, when the ID is IntType
[ https://issues.apache.org/jira/browse/SPARK-24213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468018#comment-16468018 ] Joseph K. Bradley commented on SPARK-24213: --- Thanks for reporting this issue! There is actually a much simpler fix which we can do. Also, the existing unit tests should catch this bug, so those tests themselves should be fixed. I hope you don't mind, but I'd like to go ahead and send a patch I wrote while reviewing your PR. > Power Iteration Clustering in the SparkML throws exception, when the ID is > IntType > -- > > Key: SPARK-24213 > URL: https://issues.apache.org/jira/browse/SPARK-24213 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.4.0 >Reporter: spark_user >Priority: Major > Fix For: 2.4.0 > > > While running the code, PowerIterationClustering in spark ML throws exception. > {code:scala} > val data = spark.createDataFrame(Seq( > (0, Array(1), Array(0.9)), > (1, Array(2), Array(0.9)), > (2, Array(3), Array(0.9)), > (3, Array(4), Array(0.1)), > (4, Array(5), Array(0.9)) > )).toDF("id", "neighbors", "similarities") > val result = new PowerIterationClustering() > .setK(2) > .setMaxIter(10) > .setInitMode("random") > .transform(data) > .select("id","prediction") > {code} > {code:java} > org.apache.spark.sql.AnalysisException: cannot resolve '`prediction`' given > input columns: [id, neighbors, similarities];; > 'Project [id#215, 'prediction] > +- AnalysisBarrier > +- Project [id#215, neighbors#216, similarities#217] > +- Join Inner, (id#215 = id#234) > :- Project [_1#209 AS id#215, _2#210 AS neighbors#216, _3#211 AS > similarities#217] > : +- LocalRelation [_1#209, _2#210, _3#211] > +- Project [cast(id#230L as int) AS id#234] >+- LogicalRDD [id#230L, prediction#231], false > at > org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:88) > at > org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1$$anonfun$apply$2.applyOrElse(CheckAnalysis.scala:85) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:289) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:288) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24212) PrefixSpan in spark.ml: user guide section
Joseph K. Bradley created SPARK-24212: - Summary: PrefixSpan in spark.ml: user guide section Key: SPARK-24212 URL: https://issues.apache.org/jira/browse/SPARK-24212 Project: Spark Issue Type: Documentation Components: Documentation, ML Affects Versions: 2.4.0 Reporter: Joseph K. Bradley See linked JIRA for the PrefixSpan API for which we need to write a user guide page. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-24145) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley closed SPARK-24145. - > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24145 > URL: https://issues.apache.org/jira/browse/SPARK-24145 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24145) spark.ml parity for sequential pattern mining - PrefixSpan: Python API
[ https://issues.apache.org/jira/browse/SPARK-24145?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24145. --- Resolution: Duplicate > spark.ml parity for sequential pattern mining - PrefixSpan: Python API > -- > > Key: SPARK-24145 > URL: https://issues.apache.org/jira/browse/SPARK-24145 > Project: Spark > Issue Type: New Feature > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > spark.ml parity for sequential pattern mining - PrefixSpan: Python API -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-20114. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20973 [https://github.com/apache/spark/pull/20973] > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22885. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20261 [https://github.com/apache/spark/pull/20261] > ML test for StructuredStreaming: spark.ml.tuning > > > Key: SPARK-22885 > URL: https://issues.apache.org/jira/browse/SPARK-22885 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark
[ https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-15750. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 13493 [https://github.com/apache/spark/pull/13493] > Constructing FPGrowth fails when no numPartitions specified in pyspark > -- > > Key: SPARK-15750 > URL: https://issues.apache.org/jira/browse/SPARK-15750 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > Fix For: 2.4.0 > > > {code} > >>> model1 = FPGrowth.train(rdd, 0.6) > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, > in train > model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), > int(numPartitions)) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 130, in callMLlibFunc > return callJavaFunc(sc, api, *args) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 123, in callJavaFunc > return _java2py(sc, func(*args)) > File > "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of > partitions must be positive but got -1' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24152) SparkR CRAN feasibility check server problem
[ https://issues.apache.org/jira/browse/SPARK-24152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466513#comment-16466513 ] Joseph K. Bradley commented on SPARK-24152: --- Thank you all! > SparkR CRAN feasibility check server problem > > > Key: SPARK-24152 > URL: https://issues.apache.org/jira/browse/SPARK-24152 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 2.4.0 >Reporter: Dongjoon Hyun >Assignee: Liang-Chi Hsieh >Priority: Critical > > PR builder and master branch test fails with the following SparkR error with > unknown reason. The following is an error message from that. > {code} > * this is package 'SparkR' version '2.4.0' > * checking CRAN incoming feasibility ...Error in > .check_package_CRAN_incoming(pkgdir) : > dims [product 24] do not match the length of object [0] > Execution halted > {code} > *PR BUILDER* > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/90039/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89983/ > - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/89998/ > *MASTER BRANCH* > - > https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.6/4458/ > (Fail with no failures) > This is critical because we already start to merge the PR by ignoring this > **known unkonwn** SparkR failure. > - https://github.com/apache/spark/pull/21175 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree
[ https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-24097: -- Shepherd: Joseph K. Bradley > Instruments improvements - RandomForest and GradientBoostedTree > --- > > Key: SPARK-24097 > URL: https://issues.apache.org/jira/browse/SPARK-24097 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Priority: Major > > Instruments improvements - RandomForest and GradientBoostedTree -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-24097) Instruments improvements - RandomForest and GradientBoostedTree
[ https://issues.apache.org/jira/browse/SPARK-24097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-24097: - Assignee: Weichen Xu > Instruments improvements - RandomForest and GradientBoostedTree > --- > > Key: SPARK-24097 > URL: https://issues.apache.org/jira/browse/SPARK-24097 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > > Instruments improvements - RandomForest and GradientBoostedTree -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164 ] Joseph K. Bradley edited comment on SPARK-23686 at 5/2/18 12:21 AM: [~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs on executors. This brings up the question: * Should we use Instrumentation on executors? * What levels of logging should we use on executors (in MLlib algorithms)? I figure it's safe to assume that executor logs should be more for developers than for users. (Current use in MLlib seems like this, e.g., for training of trees in https://github.com/apache/spark/pull/21163 ) These all seem to be at the DEBUG level, which is not really useful for users. (UPDATED BELOW) Since it'd be handy to have prefixes on executor logs too (to link them with Estimators), let's use Instrumentation on executors. was (Author: josephkb): [~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs on executors. This brings up the question: * Should we use Instrumentation on executors? * What levels of logging should we use on executors (in MLlib algorithms)? I figure it's safe to assume that executor logs should be more for developers than for users. (Current use in MLlib seems like this, e.g., for training of trees in https://github.com/apache/spark/pull/21163 ) These all seem to be at the DEBUG level, which is not really useful for users. Given that, I recommend: * We leave Instrumentation non-Serializable to avoid use on executors * We use regular Logging on executors. Developers who are debugging algorithms will presumably be running pretty isolated tests anyways. > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164 ] Joseph K. Bradley edited comment on SPARK-23686 at 5/1/18 9:52 PM: --- [~yogeshgarg] and [~WeichenXu123] made the good point that some logging occurs on executors. This brings up the question: * Should we use Instrumentation on executors? * What levels of logging should we use on executors (in MLlib algorithms)? I figure it's safe to assume that executor logs should be more for developers than for users. (Current use in MLlib seems like this, e.g., for training of trees in https://github.com/apache/spark/pull/21163 ) These all seem to be at the DEBUG level, which is not really useful for users. Given that, I recommend: * We leave Instrumentation non-Serializable to avoid use on executors * We use regular Logging on executors. Developers who are debugging algorithms will presumably be running pretty isolated tests anyways. was (Author: josephkb): [~yogeshgarg] made the good point that we should not convert all uses of Logging to use Instrumentation: if logging happens on executors, then we should not use the (non-serializable) Instrumentation class. E.g.: https://github.com/apache/spark/blob/6782359a04356e4cde32940861bf2410ef37f445/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1587 Also, these instances all seem to be at the DEBUG level, which is not really useful for users. > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark
[ https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-15750: - Assignee: Jeff Zhang > Constructing FPGrowth fails when no numPartitions specified in pyspark > -- > > Key: SPARK-15750 > URL: https://issues.apache.org/jira/browse/SPARK-15750 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > > {code} > >>> model1 = FPGrowth.train(rdd, 0.6) > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, > in train > model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), > int(numPartitions)) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 130, in callMLlibFunc > return callJavaFunc(sc, api, *args) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 123, in callJavaFunc > return _java2py(sc, func(*args)) > File > "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of > partitions must be positive but got -1' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-15750) Constructing FPGrowth fails when no numPartitions specified in pyspark
[ https://issues.apache.org/jira/browse/SPARK-15750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-15750: -- Shepherd: Joseph K. Bradley > Constructing FPGrowth fails when no numPartitions specified in pyspark > -- > > Key: SPARK-15750 > URL: https://issues.apache.org/jira/browse/SPARK-15750 > Project: Spark > Issue Type: Bug > Components: MLlib, PySpark >Affects Versions: 2.0.0 >Reporter: Jeff Zhang >Assignee: Jeff Zhang >Priority: Major > > {code} > >>> model1 = FPGrowth.train(rdd, 0.6) > Traceback (most recent call last): > File "", line 1, in > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/fpm.py", line 96, > in train > model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), > int(numPartitions)) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 130, in callMLlibFunc > return callJavaFunc(sc, api, *args) > File "/Users/jzhang/github/spark-2/python/pyspark/mllib/common.py", line > 123, in callJavaFunc > return _java2py(sc, func(*args)) > File > "/Users/jzhang/github/spark-2/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", > line 933, in __call__ > File "/Users/jzhang/github/spark-2/python/pyspark/sql/utils.py", line 79, > in deco > raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace) > pyspark.sql.utils.IllegalArgumentException: u'requirement failed: Number of > partitions must be positive but got -1' > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460164#comment-16460164 ] Joseph K. Bradley commented on SPARK-23686: --- [~yogeshgarg] made the good point that we should not convert all uses of Logging to use Instrumentation: if logging happens on executors, then we should not use the (non-serializable) Instrumentation class. E.g.: https://github.com/apache/spark/blob/6782359a04356e4cde32940861bf2410ef37f445/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L1587 Also, these instances all seem to be at the DEBUG level, which is not really useful for users. > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22885: -- Shepherd: Joseph K. Bradley > ML test for StructuredStreaming: spark.ml.tuning > > > Key: SPARK-22885 > URL: https://issues.apache.org/jira/browse/SPARK-22885 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22885) ML test for StructuredStreaming: spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-22885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22885: - Assignee: Weichen Xu > ML test for StructuredStreaming: spark.ml.tuning > > > Key: SPARK-22885 > URL: https://issues.apache.org/jira/browse/SPARK-22885 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24115) improve instrumentation for spark.ml.tuning
[ https://issues.apache.org/jira/browse/SPARK-24115?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16459033#comment-16459033 ] Joseph K. Bradley commented on SPARK-24115: --- Sounds good; go ahead. > improve instrumentation for spark.ml.tuning > --- > > Key: SPARK-24115 > URL: https://issues.apache.org/jira/browse/SPARK-24115 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
[ https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-22210: - Assignee: Lu Wang > Online LDA variationalTopicInference should use random seed to have stable > behavior > > > Key: SPARK-22210 > URL: https://issues.apache.org/jira/browse/SPARK-22210 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Lu Wang >Priority: Minor > > https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 > Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
[ https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22210: -- Shepherd: Joseph K. Bradley > Online LDA variationalTopicInference should use random seed to have stable > behavior > > > Key: SPARK-22210 > URL: https://issues.apache.org/jira/browse/SPARK-22210 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 > Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22210) Online LDA variationalTopicInference should use random seed to have stable behavior
[ https://issues.apache.org/jira/browse/SPARK-22210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16453228#comment-16453228 ] Joseph K. Bradley commented on SPARK-22210: --- [~lu.DB] Would you like to do this? It should be a matter of taking the "seed" Param passed to LDA and making sure it (or a seed generated from it) is passed down to this method. Thanks! > Online LDA variationalTopicInference should use random seed to have stable > behavior > > > Key: SPARK-22210 > URL: https://issues.apache.org/jira/browse/SPARK-22210 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Minor > > https://github.com/apache/spark/blob/16fab6b0ef3dcb33f92df30e17680922ad5fb672/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L582 > Gamma distribution should use random seed to have consistent behavior. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23824) Make inpurityStats publicly accessible in ml.tree.Node
[ https://issues.apache.org/jira/browse/SPARK-23824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23824. --- Resolution: Duplicate > Make inpurityStats publicly accessible in ml.tree.Node > -- > > Key: SPARK-23824 > URL: https://issues.apache.org/jira/browse/SPARK-23824 > Project: Spark > Issue Type: Wish > Components: ML >Affects Versions: 2.1.1 >Reporter: Barry Becker >Priority: Minor > > This is minor, but it is also a very easy fix. > I would like to visualize the structure of a decision tree model, but > currently the only means of obtaining the label distribution data at each > node of the tree is hidden within each ml.tree.Node inside the impurityStats. > I'm pretty sure that the fix for this is as easy as removing the private[ml] > qualifier from occurrences of > private[ml] def impurityStats: ImpurityCalculator > and > override private[ml] val impurityStats: ImpurityCalculator > > As a workaround, I've put my class that needs access into a > org.apache.spark.ml.tree package in my own repository, but I would really > like to not have to do that. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-20114: - Assignee: Weichen Xu > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Assignee: Weichen Xu >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20114: -- Shepherd: Joseph K. Bradley > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20114) spark.ml parity for sequential pattern mining - PrefixSpan
[ https://issues.apache.org/jira/browse/SPARK-20114?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-20114: -- Target Version/s: 2.4.0 > spark.ml parity for sequential pattern mining - PrefixSpan > -- > > Key: SPARK-20114 > URL: https://issues.apache.org/jira/browse/SPARK-20114 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.2.0 >Reporter: yuhao yang >Priority: Major > > Creating this jira to track the feature parity for PrefixSpan and sequential > pattern mining in Spark ml with DataFrame API. > First list a few design issues to be discussed, then subtasks like Scala, > Python and R API will be created. > # Wrapping the MLlib PrefixSpan and provide a generic fit() should be > straightforward. Yet PrefixSpan only extracts frequent sequential patterns, > which is not good to be used directly for predicting on new records. Please > read > http://data-mining.philippe-fournier-viger.com/introduction-to-sequential-rule-mining/ > for some background knowledge. Thanks Philippe Fournier-Viger for providing > insights. If we want to keep using the Estimator/Transformer pattern, options > are: > #* Implement a dummy transform for PrefixSpanModel, which will not add > new column to the input DataSet. The PrefixSpanModel is only used to provide > access for frequent sequential patterns. > #* Adding the feature to extract sequential rules from sequential > patterns. Then use the sequential rules in the transform as FPGrowthModel. > The rules extracted are of the form X–> Y where X and Y are sequential > patterns. But in practice, these rules are not very good as they are too > precise and thus not noise tolerant. > # Different from association rules and frequent itemsets, sequential rules > can be extracted from the original dataset more efficiently using algorithms > like RuleGrowth, ERMiner. The rules are X–> Y where X is unordered and Y is > unordered, but X must appear before Y, which is more general and can work > better in practice for prediction. > I'd like to hear more from the users to see which kind of Sequential rules > are more practical. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23990. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21078 [https://github.com/apache/spark/pull/21078] > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23455. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20633 [https://github.com/apache/spark/pull/20633] > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.4.0 > > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16450161#comment-16450161 ] Joseph K. Bradley commented on SPARK-23975: --- I merged https://github.com/apache/spark/pull/21081 for KMeans, and [~lu.DB] will follow up for the other algs. > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Priority: Major > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23975: - Assignee: Lu Wang > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23455: -- Target Version/s: 2.4.0 > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23455) Default Params in ML should be saved separately
[ https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23455: - Assignee: Liang-Chi Hsieh > Default Params in ML should be saved separately > --- > > Key: SPARK-23455 > URL: https://issues.apache.org/jira/browse/SPARK-23455 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.4.0 >Reporter: Liang-Chi Hsieh >Assignee: Liang-Chi Hsieh >Priority: Major > > We save ML's user-supplied params and default params as one entity in JSON. > During loading the saved models, we set all the loaded params into created ML > model instances as user-supplied params. > It causes some problems, e.g., if we strictly disallow some params to be set > at the same time, a default param can fail the param check because it is > treated as user-supplied param after loading. > The loaded default params should not be set as user-supplied params. We > should save ML default params separately in JSON. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-24058) Default Params in ML should be saved separately: Python API
[ https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448994#comment-16448994 ] Joseph K. Bradley commented on SPARK-24058: --- CCing [~viirya] since you're the natural one to take this. Thanks! > Default Params in ML should be saved separately: Python API > --- > > Key: SPARK-24058 > URL: https://issues.apache.org/jira/browse/SPARK-24058 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > See [SPARK-23455] for reference. Since DefaultParamsReader has been changed > in Scala, we must change it for Python for Spark 2.4.0 as well in order to > keep the 2 in sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24058) Default Params in ML should be saved separately: Python API
Joseph K. Bradley created SPARK-24058: - Summary: Default Params in ML should be saved separately: Python API Key: SPARK-24058 URL: https://issues.apache.org/jira/browse/SPARK-24058 Project: Spark Issue Type: Improvement Components: ML, PySpark Affects Versions: 2.4.0 Reporter: Joseph K. Bradley See [SPARK-23455] for reference. Since DefaultParamsReader has been changed in Scala, we must change it for Python for Spark 2.4.0 as well in order to keep the 2 in sync. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23990: -- Shepherd: Joseph K. Bradley > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23990: - Assignee: Weichen Xu > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Assignee: Weichen Xu >Priority: Major > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-24026) spark.ml Scala/Java API for PIC
[ https://issues.apache.org/jira/browse/SPARK-24026?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-24026. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21090 [https://github.com/apache/spark/pull/21090] > spark.ml Scala/Java API for PIC > --- > > Key: SPARK-24026 > URL: https://issues.apache.org/jira/browse/SPARK-24026 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Miao Wang >Priority: Major > Fix For: 2.4.0 > > > See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-24026) spark.ml Scala/Java API for PIC
Joseph K. Bradley created SPARK-24026: - Summary: spark.ml Scala/Java API for PIC Key: SPARK-24026 URL: https://issues.apache.org/jira/browse/SPARK-24026 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 2.3.0 Reporter: Joseph K. Bradley Assignee: Miao Wang See parent JIRA -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18693) BinaryClassificationEvaluator, RegressionEvaluator, and MulticlassClassificationEvaluator should use sample weight data
[ https://issues.apache.org/jira/browse/SPARK-18693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441713#comment-16441713 ] Joseph K. Bradley commented on SPARK-18693: --- [~imatiach] Would you mind creating JIRA subtasks so that we have 1 PR per JIRA? That helps with tracking. Thanks! > BinaryClassificationEvaluator, RegressionEvaluator, and > MulticlassClassificationEvaluator should use sample weight data > --- > > Key: SPARK-18693 > URL: https://issues.apache.org/jira/browse/SPARK-18693 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.0.2 >Reporter: Devesh Parekh >Priority: Major > > The LogisticRegression and LinearRegression models support training with a > weight column, but the corresponding evaluators do not support computing > metrics using those weights. This breaks model selection using CrossValidator. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23990) Instruments logging improvements - ML regression package
[ https://issues.apache.org/jira/browse/SPARK-23990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441701#comment-16441701 ] Joseph K. Bradley commented on SPARK-23990: --- A complication was brought up by this PR: Some logging occurs in classes which are not Estimators (WeightedLeastSquares, IterativelyReweightedLeastSquares) and in static objects (RandomForest, GradientBoostedTrees). These may have an Instrumentation instance available (when used from an Estimator) or may not (when used in a unit test). Options include: 1. Make these require Instrumentation instances. This would require slightly awkward changes to unit tests. 2. Create something similar to Instrumentation or Logging which can store an Optional Instrumentation instance. If the Instrumentation is available, it can log via that; otherwise, it can call into regular Logging. 2a. This could be a trait like Logging. This is nice in that it requires fewer changes to existing logging code. 2b. This could be a class like Instrumentation. This is nice in that it standardizes all of MLlib around Instrumentation instead of Logging. I'd vote for 2b to standardize what we do in MLlib. Thoughts? > Instruments logging improvements - ML regression package > > > Key: SPARK-23990 > URL: https://issues.apache.org/jira/browse/SPARK-23990 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 > Environment: Instruments logging improvements - ML regression package >Reporter: Weichen Xu >Priority: Major > Original Estimate: 120h > Remaining Estimate: 120h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22884) ML test for StructuredStreaming: spark.ml.clustering
[ https://issues.apache.org/jira/browse/SPARK-22884?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22884: -- Shepherd: Joseph K. Bradley > ML test for StructuredStreaming: spark.ml.clustering > > > Key: SPARK-22884 > URL: https://issues.apache.org/jira/browse/SPARK-22884 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Priority: Major > > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8799: - Shepherd: Joseph K. Bradley > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441207#comment-16441207 ] Joseph K. Bradley commented on SPARK-8799: -- The missing functionality was added in [SPARK-9312], but we cannot fix this JIRA until 3.0.0 since it will require breaking APIs (changing OneVsRest's inheritance structure and supported FeatureTypes). Let's target this fix for 3.0.0, for which I'll recommend: * Rename current OneVsRest to GenericOneVsRest or something like that. Have it inherit from Classifier and take a type parameter for FeaturesType. * Add a specialization of GenericOneVsRest with fixed FeaturesType = VectorUDT, and call this new one OneVsRest. I _think_ that will avoid breaking most user code (but I have not thought it through carefully). > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-8799) OneVsRestModel should extend ClassificationModel
[ https://issues.apache.org/jira/browse/SPARK-8799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-8799: - Target Version/s: 3.0.0 > OneVsRestModel should extend ClassificationModel > > > Key: SPARK-8799 > URL: https://issues.apache.org/jira/browse/SPARK-8799 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Feynman Liang >Priority: Minor > > Many parts of `OneVsRestModel` can be generalized to `ClassificationModel`. > For example: > * `accColName` can be used to populate `ClassificationModel#predictRaw` and > share implementations of `transform` > * SPARK-8092 adds `setFeaturesCol` and `setPredictionCol` which could be > gotten for free through subclassing > `ClassificationModel` is the correct supertype (e.g. not `PredictionModel`) > because the labels for a `OneVsRest` will always be discrete and finite. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21741: - Assignee: Weichen Xu > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21741) Python API for DataFrame-based multivariate summarizer
[ https://issues.apache.org/jira/browse/SPARK-21741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21741. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20695 [https://github.com/apache/spark/pull/20695] > Python API for DataFrame-based multivariate summarizer > -- > > Key: SPARK-21741 > URL: https://issues.apache.org/jira/browse/SPARK-21741 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Yanbo Liang >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > We support multivariate summarizer for DataFrame API at SPARK-19634, we > should also make PySpark support it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23975) Allow Clustering to take Arrays of Double as input features
[ https://issues.apache.org/jira/browse/SPARK-23975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23975: -- Shepherd: Joseph K. Bradley > Allow Clustering to take Arrays of Double as input features > --- > > Key: SPARK-23975 > URL: https://issues.apache.org/jira/browse/SPARK-23975 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.3.0 >Reporter: Lu Wang >Priority: Major > > Clustering algorithms should accept Arrays in addition to Vectors as input > features. The python interface should also be changed so that it would make > PySpark a lot easier to use. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-21088) CrossValidator, TrainValidationSplit should collect all models when fitting: Python API
[ https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-21088. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 19627 [https://github.com/apache/spark/pull/19627] > CrossValidator, TrainValidationSplit should collect all models when fitting: > Python API > --- > > Key: SPARK-21088 > URL: https://issues.apache.org/jira/browse/SPARK-21088 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > In pyspark: > We add a parameter whether to collect the full model list when > CrossValidator/TrainValidationSplit training (Default is NOT, avoid the > change cause OOM) > Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to > get the model list > CrossValidatorModelWriter add a “option”, allow user to control whether to > persist the model list to disk. > Note: when persisting the model list, use indices as the sub-model path -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-21088) CrossValidator, TrainValidationSplit should collect all models when fitting: Python API
[ https://issues.apache.org/jira/browse/SPARK-21088?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-21088: - Assignee: Weichen Xu > CrossValidator, TrainValidationSplit should collect all models when fitting: > Python API > --- > > Key: SPARK-21088 > URL: https://issues.apache.org/jira/browse/SPARK-21088 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > In pyspark: > We add a parameter whether to collect the full model list when > CrossValidator/TrainValidationSplit training (Default is NOT, avoid the > change cause OOM) > Add a method in CrossValidatorModel/TrainValidationSplitModel, allow user to > get the model list > CrossValidatorModelWriter add a “option”, allow user to control whether to > persist the model list to disk. > Note: when persisting the model list, use indices as the sub-model path -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-9312) The OneVsRest model does not provide rawPrediction
[ https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-9312: - Summary: The OneVsRest model does not provide rawPrediction (was: The OneVsRest model does not provide confidence factor(not probability) along with the prediction) > The OneVsRest model does not provide rawPrediction > -- > > Key: SPARK-9312 > URL: https://issues.apache.org/jira/browse/SPARK-9312 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0, 1.4.1 >Reporter: Badari Madhav >Assignee: Lu Wang >Priority: Major > Labels: features > Fix For: 2.4.0 > > Original Estimate: 72h > Remaining Estimate: 72h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction
[ https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-9312: Assignee: Lu Wang > The OneVsRest model does not provide confidence factor(not probability) along > with the prediction > - > > Key: SPARK-9312 > URL: https://issues.apache.org/jira/browse/SPARK-9312 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0, 1.4.1 >Reporter: Badari Madhav >Assignee: Lu Wang >Priority: Major > Labels: features > Fix For: 2.4.0 > > Original Estimate: 72h > Remaining Estimate: 72h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-9312) The OneVsRest model does not provide confidence factor(not probability) along with the prediction
[ https://issues.apache.org/jira/browse/SPARK-9312?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-9312. -- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21044 [https://github.com/apache/spark/pull/21044] > The OneVsRest model does not provide confidence factor(not probability) along > with the prediction > - > > Key: SPARK-9312 > URL: https://issues.apache.org/jira/browse/SPARK-9312 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 1.4.0, 1.4.1 >Reporter: Badari Madhav >Assignee: Lu Wang >Priority: Major > Labels: features > Fix For: 2.4.0 > > Original Estimate: 72h > Remaining Estimate: 72h > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22883. --- Resolution: Fixed Fix Version/s: 2.3.1 Issue resolved by pull request 21042 [https://github.com/apache/spark/pull/21042] > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0, 2.3.1 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Fix Version/s: 2.4.0 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22883) ML test for StructuredStreaming: spark.ml.feature, A-M
[ https://issues.apache.org/jira/browse/SPARK-22883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22883: -- Target Version/s: 2.3.1, 2.4.0 > ML test for StructuredStreaming: spark.ml.feature, A-M > -- > > Key: SPARK-22883 > URL: https://issues.apache.org/jira/browse/SPARK-22883 > Project: Spark > Issue Type: Test > Components: ML, Tests >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Major > Fix For: 2.4.0 > > > *For featurizers with names from A - M* > Task for adding Structured Streaming tests for all Models/Transformers in a > sub-module in spark.ml > For an example, see LinearRegressionSuite.scala in > https://github.com/apache/spark/pull/19843 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-19947) RFormulaModel always throws Exception on transforming data with NULL or Unseen labels
[ https://issues.apache.org/jira/browse/SPARK-19947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-19947. --- Resolution: Fixed Fix Version/s: 2.4.0 I'll mark this as complete. Those earlier PRs fixed some issues, and [SPARK-23562] should fix the rest. > RFormulaModel always throws Exception on transforming data with NULL or > Unseen labels > - > > Key: SPARK-19947 > URL: https://issues.apache.org/jira/browse/SPARK-19947 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.0 >Reporter: Andrey Yatsuk >Priority: Major > Fix For: 2.4.0 > > > I have trained ML model and big data table in parquet. I want add new column > to this table with predicted values. I can't lose any data, but can having > null values in it. > RFormulaModel.fit() method creates new StringIndexer with default > (handleInvalid="error") parameter. Also VectorAssembler on NULL values > throwing Exception. So I must call df.na.drop() to transform this DataFrame > and I don't want to do this. > Need add to RFormula new parameter like handleInvalid in StringIndexer. > Or add transform(Seq): Vector method which user can use as UDF method > in df.withColumn("predicted", functions.callUDF(rFormulaModel::transform, > Seq)) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23562. --- Resolution: Fixed Fix Version/s: 2.4.0 I think everything has been fixed, so I'll close this. Thanks [~yogeshgarg] and [~huaxingao]! > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. > --> Discussed offline: null values can be converted to NaN values for "keep" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23562) RFormula handleInvalid should handle invalid values in non-string columns.
[ https://issues.apache.org/jira/browse/SPARK-23562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23562: -- Shepherd: Joseph K. Bradley > RFormula handleInvalid should handle invalid values in non-string columns. > -- > > Key: SPARK-23562 > URL: https://issues.apache.org/jira/browse/SPARK-23562 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > Fix For: 2.4.0 > > > Currently when handleInvalid is set to 'keep' or 'skip' this only applies to > String fields. Numeric fields that are null will either cause the transformer > to fail or might be null in the resulting label column. > I'm not sure what the semantics of keep might be for numeric columns with > null values, but we should be able to at least support skip for these types. > --> Discussed offline: null values can be converted to NaN values for "keep" -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23944) Add Param set functions to LSHModel types
[ https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23944. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21015 [https://github.com/apache/spark/pull/21015] > Add Param set functions to LSHModel types > - > > Key: SPARK-23944 > URL: https://issues.apache.org/jira/browse/SPARK-23944 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > Fix For: 2.4.0 > > > 2 param set methods ( setInputCol, setOutputCol) are added to the two > LSHModel types for min hash and random projections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23944) Add Param set functions to LSHModel types
[ https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23944: - Assignee: Lu Wang > Add Param set functions to LSHModel types > - > > Key: SPARK-23944 > URL: https://issues.apache.org/jira/browse/SPARK-23944 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Lu Wang >Assignee: Lu Wang >Priority: Major > > 2 param set methods ( setInputCol, setOutputCol) are added to the two > LSHModel types for min hash and random projections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23871. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 21003 [https://github.com/apache/spark/pull/21003] > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23871: -- Shepherd: Joseph K. Bradley > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23871) add python api for VectorAssembler handleInvalid
[ https://issues.apache.org/jira/browse/SPARK-23871?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23871: - Assignee: Huaxin Gao > add python api for VectorAssembler handleInvalid > > > Key: SPARK-23871 > URL: https://issues.apache.org/jira/browse/SPARK-23871 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: Huaxin Gao >Priority: Minor > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21856) Update Python API for MultilayerPerceptronClassifierModel
[ https://issues.apache.org/jira/browse/SPARK-21856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-21856: -- Fix Version/s: 2.3.0 > Update Python API for MultilayerPerceptronClassifierModel > - > > Key: SPARK-21856 > URL: https://issues.apache.org/jira/browse/SPARK-21856 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.3.0 >Reporter: Weichen Xu >Assignee: Chunsheng Ji >Priority: Minor > Fix For: 2.3.0 > > > SPARK-12664 has exposed probability in MultilayerPerceptronClassifier, so > python API also need update. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23751: - Assignee: Weichen Xu > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23751. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20904 [https://github.com/apache/spark/pull/20904] > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23944) Add Param set functions to LSHModel types
[ https://issues.apache.org/jira/browse/SPARK-23944?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23944: -- Fix Version/s: (was: 2.4.0) > Add Param set functions to LSHModel types > - > > Key: SPARK-23944 > URL: https://issues.apache.org/jira/browse/SPARK-23944 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.3.0 >Reporter: Lu Wang >Priority: Major > > 2 param set methods ( setInputCol, setOutputCol) are added to the two > LSHModel types for min hash and random projections. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes
[ https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-14681. --- Resolution: Fixed Fix Version/s: 2.4.0 Issue resolved by pull request 20786 [https://github.com/apache/spark/pull/20786] > Provide label/impurity stats for spark.ml decision tree nodes > - > > Key: SPARK-14681 > URL: https://issues.apache.org/jira/browse/SPARK-14681 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > Currently, spark.ml decision trees provide all node info except for the > aggregated stats about labels and impurities. This task is to provide those > publicly. We need to choose a good API for it, so we should discuss the > design on this issue before implementing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-14681) Provide label/impurity stats for spark.ml decision tree nodes
[ https://issues.apache.org/jira/browse/SPARK-14681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-14681: - Assignee: Weichen Xu > Provide label/impurity stats for spark.ml decision tree nodes > - > > Key: SPARK-14681 > URL: https://issues.apache.org/jira/browse/SPARK-14681 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > > Currently, spark.ml decision trees provide all node info except for the > aggregated stats about labels and impurities. This task is to provide those > publicly. We need to choose a good API for it, so we should discuss the > design on this issue before implementing it. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21005) VectorIndexerModel does not prepare output column field correctly
[ https://issues.apache.org/jira/browse/SPARK-21005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16431079#comment-16431079 ] Joseph K. Bradley commented on SPARK-21005: --- I don't actually see why this is a problem: If a feature is categorical, we should not silently convert it to continuous. To use a high-arity categorical feature in a decision tree, one should convert it to a different representation first, such as hashing to a set of bins with HashingTF. That said, I do think we should clarify this behavior in the VectorIndexer docstring. I know it's been a long time since you sent your PR, but would you want to update it to simply update the docs? If you're busy now, I'd be happy to take it over though. Thanks! > VectorIndexerModel does not prepare output column field correctly > - > > Key: SPARK-21005 > URL: https://issues.apache.org/jira/browse/SPARK-21005 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.1 >Reporter: Chen Lin >Priority: Major > > From my understanding through reading the documentation, VectorIndexer > decides which features should be categorical based on the number of distinct > values, where features with at most maxCategories are declared categorical. > Meanwhile, those features which exceed maxCategories are declared continuous. > Currently, VectorIndexerModel works all right with a dataset which has empty > schema. However, when VectorIndexerModel is transforming on a dataset with > `ML_ATTR` metadata, it may not output the expected result. For example, a > feature with nominal attribute which has distinct values exceeding > maxCategorie will not be treated as a continuous feature as we expected but > still a categorical feature. Thus, it may cause all the tree-based algorithms > (like Decision Tree, Random Forest, GBDT, etc.) throw errors as "DecisionTree > requires maxBins (= $maxPossibleBins) to be at least as large as the number > of values in each categorical feature, but categorical feature $maxCategory > has $maxCategoriesPerFeature values. Considering remove this and other > categorical features with a large number of values, or add more training > examples.". > Correct me if my understanding is wrong. > I will submit a PR soon to resolve this issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18092) add type cast to avoid error "Column prediction must be of type DoubleType but was actually FloatType"
[ https://issues.apache.org/jira/browse/SPARK-18092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16429134#comment-16429134 ] Joseph K. Bradley commented on SPARK-18092: --- Can you please add a description and make the title more specific? Currently, it's unclear what this JIRA is addressing without looking at the code. Thanks! > add type cast to avoid error "Column prediction must be of type DoubleType > but was actually FloatType" > -- > > Key: SPARK-18092 > URL: https://issues.apache.org/jira/browse/SPARK-18092 > Project: Spark > Issue Type: Bug >Reporter: albert fang >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23751) Kolmogorov-Smirnoff test Python API in pyspark.ml
[ https://issues.apache.org/jira/browse/SPARK-23751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23751: -- Shepherd: Joseph K. Bradley > Kolmogorov-Smirnoff test Python API in pyspark.ml > - > > Key: SPARK-23751 > URL: https://issues.apache.org/jira/browse/SPARK-23751 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 2.4.0 >Reporter: Joseph K. Bradley >Priority: Major > > Python wrapper for new DataFrame-based API for KS test -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23859) Initial PR for Instrumentation improvements: UUID and logging levels
[ https://issues.apache.org/jira/browse/SPARK-23859?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23859. --- Resolution: Fixed Fix Version/s: 2.4.0 Resolved with https://github.com/apache/spark/pull/20982 > Initial PR for Instrumentation improvements: UUID and logging levels > > > Key: SPARK-23859 > URL: https://issues.apache.org/jira/browse/SPARK-23859 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Joseph K. Bradley >Assignee: Weichen Xu >Priority: Major > Fix For: 2.4.0 > > > This is a subtask for an initial PR to improve MLlib's Instrumentation class > for logging. It will address a couple of issues and use the changes in > LogisticRegression as an example class. > Issues: > * The UUID is currently generated from an atomic integer. This is a problem > since the integer is reset whenever a persisted Estimator is loaded on a new > cluster. We should just use a random UUID to get a new UUID each time with > high probability. > * We use both Instrumentation and Logging to log stuff. Let's standardize > around Instrumentation in MLlib since it can associate logs with the > Estimator or Transformer which produced the logs (via a prefix with the > algorithm's name or UUID). -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23686) Make better usage of org.apache.spark.ml.util.Instrumentation
[ https://issues.apache.org/jira/browse/SPARK-23686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16428612#comment-16428612 ] Joseph K. Bradley commented on SPARK-23686: --- I wanted to ping some other active MLlib committers since this will change logging in MLlib. The main change will be to prefix logged messages with a string included a unique identifier for the algorithm. That will make it easier to associate log messages with Pipeline stages; this is hard right now, e.g., if there are multiple StringIndexers in the same Pipeline.fit() call. CC [~mlnick], [~holdenk], [~dbtsai], [~yanboliang], [~sethah] > Make better usage of org.apache.spark.ml.util.Instrumentation > - > > Key: SPARK-23686 > URL: https://issues.apache.org/jira/browse/SPARK-23686 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Priority: Major > > This Jira is a bit high level and might require subtasks or other jiras for > more specific tasks. > I've noticed that we don't make the best usage of the instrumentation class. > Specifically sometimes we bypass the instrumentation class and use the > debugger instead. For example, > [https://github.com/apache/spark/blob/9b9827759af2ca3eea146a6032f9165f640ce152/mllib/src/main/scala/org/apache/spark/ml/tree/impl/RandomForest.scala#L143] > Also there are some things that might be useful to log in the instrumentation > class that we currently don't. For example: > number of training examples > mean/var of label (regression) > I know computing these things can be expensive in some cases, but especially > when this data is already available we can log it for free. For example, > Logistic Regression Summarizer computes some useful data including numRows > that we don't log. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler
[ https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-23870. --- Resolution: Fixed Fix Version/s: 2.4.0 Resolved via https://github.com/apache/spark/pull/20970 > Forward RFormula handleInvalid Param to VectorAssembler > > > Key: SPARK-23870 > URL: https://issues.apache.org/jira/browse/SPARK-23870 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: yogesh garg >Priority: Major > Fix For: 2.4.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler
[ https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-23870: - Assignee: yogesh garg > Forward RFormula handleInvalid Param to VectorAssembler > > > Key: SPARK-23870 > URL: https://issues.apache.org/jira/browse/SPARK-23870 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Assignee: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23870) Forward RFormula handleInvalid Param to VectorAssembler
[ https://issues.apache.org/jira/browse/SPARK-23870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23870: -- Fix Version/s: (was: 2.4.0) > Forward RFormula handleInvalid Param to VectorAssembler > > > Key: SPARK-23870 > URL: https://issues.apache.org/jira/browse/SPARK-23870 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: yogesh garg >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-22667) Fix model-specific optimization support for ML tuning: Python API
[ https://issues.apache.org/jira/browse/SPARK-22667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-22667. --- Resolution: Duplicate Fix Version/s: 2.3.0 > Fix model-specific optimization support for ML tuning: Python API > - > > Key: SPARK-22667 > URL: https://issues.apache.org/jira/browse/SPARK-22667 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.0 >Reporter: Weichen Xu >Priority: Major > Fix For: 2.3.0 > > > Fix model-specific optimization support for ML tuning: Python API > See explanation here > https://docs.google.com/document/d/1xw5M4sp1e0eQie75yIt-r6-GTuD5vpFf_I6v-AFBM3M/edit -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org