[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284 ] Yan Facai (颜发才) edited comment on SPARK-3383 at 11/6/17 1:28 PM: - [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from your work. was (Author: facai): [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from the PR. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240284#comment-16240284 ] Yan Facai (颜发才) commented on SPARK-3383: [~WeichenXu123] Good work! I'd like to take a look if time allows. Anyway, I believe that unordered features can benefit a lot from the PR. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3165) DecisionTree does not use sparsity in data
[ https://issues.apache.org/jira/browse/SPARK-3165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181925#comment-16181925 ] Yan Facai (颜发才) commented on SPARK-3165: The PR proposed by me has been closed because another better solution exists. So, welcome to take over the JIRA if you are interested in it. Thanks. > DecisionTree does not use sparsity in data > -- > > Key: SPARK-3165 > URL: https://issues.apache.org/jira/browse/SPARK-3165 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: computation > DecisionTree should take advantage of sparse feature vectors. Aggregation > over training data could handle the empty/zero-valued data elements more > efficiently. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21748) Migrate the implementation of HashingTF from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133947#comment-16133947 ] Yan Facai (颜发才) commented on SPARK-21748: - There seems to be something wrong with CI. The corresponding PR is not automatically linked here, https://github.com/apache/spark/pull/18902. > Migrate the implementation of HashingTF from MLlib to ML > > > Key: SPARK-21748 > URL: https://issues.apache.org/jira/browse/SPARK-21748 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 > Reporter: Yan Facai (颜发才) > > To add indexOf method for ml.feature.HashingTF, it's better to migrate the > implementation from MLLib to ML. > I have been worked on it, and I'll submit my PR later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21748) Migrate the implementation of HashingTF from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133947#comment-16133947 ] Yan Facai (颜发才) edited comment on SPARK-21748 at 8/19/17 4:43 AM: -- There seems to be something wrong with CI. The corresponding PR is not automatically linked here, https://github.com/apache/spark/pull/18998. was (Author: facai): There seems to be something wrong with CI. The corresponding PR is not automatically linked here, https://github.com/apache/spark/pull/18902. > Migrate the implementation of HashingTF from MLlib to ML > > > Key: SPARK-21748 > URL: https://issues.apache.org/jira/browse/SPARK-21748 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 > Reporter: Yan Facai (颜发才) > > To add indexOf method for ml.feature.HashingTF, it's better to migrate the > implementation from MLLib to ML. > I have been worked on it, and I'll submit my PR later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21748) Migrate the implementation of HashingTF from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128725#comment-16128725 ] Yan Facai (颜发才) edited comment on SPARK-21748 at 8/16/17 12:38 PM: --- [~yanboliang] Thanks, yanbo. As discussed on https://github.com/apache/spark/pull/18736, the JIRA is created. When this work is got done, it'll be easier for SPARK-21481. was (Author: facai): [~yanboliang] Thanks, yanbo. As discussed on https://github.com/apache/spark/pull/18736, the JIRA is created. When this work is got done, it's easier for SPARK-21481. > Migrate the implementation of HashingTF from MLlib to ML > > > Key: SPARK-21748 > URL: https://issues.apache.org/jira/browse/SPARK-21748 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yan Facai (颜发才) > > To add indexOf method for ml.feature.HashingTF, it's better to migrate the > implementation from MLLib to ML. > I have been worked on it, and I'll submit my PR later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21748) Migrate the implementation of HashingTF from MLlib to ML
[ https://issues.apache.org/jira/browse/SPARK-21748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16128725#comment-16128725 ] Yan Facai (颜发才) commented on SPARK-21748: - [~yanboliang] Thanks, yanbo. As discussed on https://github.com/apache/spark/pull/18736, the JIRA is created. When this work is got done, it's easier for SPARK-21481. > Migrate the implementation of HashingTF from MLlib to ML > > > Key: SPARK-21748 > URL: https://issues.apache.org/jira/browse/SPARK-21748 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.2.0 >Reporter: Yan Facai (颜发才) > > To add indexOf method for ml.feature.HashingTF, it's better to migrate the > implementation from MLLib to ML. > I have been worked on it, and I'll submit my PR later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21748) Migrate the implementation of HashingTF from MLlib to ML
Yan Facai (颜发才) created SPARK-21748: --- Summary: Migrate the implementation of HashingTF from MLlib to ML Key: SPARK-21748 URL: https://issues.apache.org/jira/browse/SPARK-21748 Project: Spark Issue Type: Improvement Components: ML, MLlib Affects Versions: 2.2.0 Reporter: Yan Facai (颜发才) To add indexOf method for ml.feature.HashingTF, it's better to migrate the implementation from MLLib to ML. I have been worked on it, and I'll submit my PR later. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122922#comment-16122922 ] Yan Facai (颜发才) commented on SPARK-21690: - Cool! Just go head. > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122904#comment-16122904 ] Yan Facai (颜发才) edited comment on SPARK-21690 at 8/11/17 6:02 AM: -- We can use `df.summary("mean")` and `df.stat.approxQuantile(df.columns, Array(0.5), 0.001)`, however, two passes are needed. How about the solution? If it's OK, I can work on it. was (Author: facai): We can use `df.summary("mean")` and `df.stat.approxQuantile(df.columns, Array(0.5), 0.001), however, two passes are needed. How about the solution? If it's OK, I can work on it. > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122904#comment-16122904 ] Yan Facai (颜发才) edited comment on SPARK-21690 at 8/11/17 6:02 AM: -- We can use `df.summary("mean")` and `df.stat.approxQuantile(df.columns, Array(0.5), 0.001)`. How about the solution? If it's OK, I can work on it. was (Author: facai): We can use `df.summary("mean")` and `df.stat.approxQuantile(df.columns, Array(0.5), 0.001)`, however, two passes are needed. How about the solution? If it's OK, I can work on it. > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21690) one-pass imputer
[ https://issues.apache.org/jira/browse/SPARK-21690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122904#comment-16122904 ] Yan Facai (颜发才) commented on SPARK-21690: - We can use `df.summary("mean")` and `df.stat.approxQuantile(df.columns, Array(0.5), 0.001), however, two passes are needed. How about the solution? If it's OK, I can work on it. > one-pass imputer > > > Key: SPARK-21690 > URL: https://issues.apache.org/jira/browse/SPARK-21690 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: zhengruifeng > > {code} > val surrogates = $(inputCols).map { inputCol => > val ic = col(inputCol) > val filtered = dataset.select(ic.cast(DoubleType)) > .filter(ic.isNotNull && ic =!= $(missingValue) && !ic.isNaN) > if(filtered.take(1).length == 0) { > throw new SparkException(s"surrogate cannot be computed. " + > s"All the values in $inputCol are Null, Nan or > missingValue(${$(missingValue)})") > } > val surrogate = $(strategy) match { > case Imputer.mean => filtered.select(avg(inputCol)).as[Double].first() > case Imputer.median => filtered.stat.approxQuantile(inputCol, > Array(0.5), 0.001).head > } > surrogate > } > {code} > Current impl of {{Imputer}} process one column after after another. In this > place, we should parallelize the processing in a more efficient way. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079507#comment-16079507 ] Yan Facai (颜发才) commented on SPARK-21341: - Yes, [~sowen] is right. Why not to use save and load method? > Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel > - > > Key: SPARK-21341 > URL: https://issues.apache.org/jira/browse/SPARK-21341 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Zied Sellami > > I am using sparContext.saveAsObjectFile to save a complex object containing a > pipelineModel with a Word2Vec ML Transformer. When I load the object and call > myPipelineModel.transform, Word2VecModel raise a null pointer error on line > 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by > removing@transient annotation on val wordVectors and @transient lazy val on > getVectors function. > -Why this 2 val are transient ? > -Any solution to add a boolean function on the Word2Vec Transformer to force > the serialization of wordVectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987 ] Yan Facai (颜发才) edited comment on SPARK-21341 at 7/8/17 6:28 AM: - Hi, [~zsellami]. I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see: {code} 336 override protected def saveImpl(path: String): Unit = { 337 DefaultParamsWriter.saveMetadata(instance, path, sc) 338 339 val wordVectors = instance.wordVectors.getVectors 340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } 341 val dataPath = new Path(path, "data").toString 342 sparkSession.createDataFrame(dataSeq) 343 .repartition(calculateNumberOfPartitions) 344 .write 345 .parquet(dataPath) 346 } {code} In all, developers indeed take a try to save the wordVector, however it seems to break in pipeline as you said. So, could you give an example code to reproduce the bug? I'd like to dig deeper. was (Author: facai): Hi, [~zsellami]. I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see: {code} 336 override protected def saveImpl(path: String): Unit = { 337 DefaultParamsWriter.saveMetadata(instance, path, sc) 338 339 val wordVectors = instance.wordVectors.getVectors 340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } 341 val dataPath = new Path(path, "data").toString 342 sparkSession.createDataFrame(dataSeq) 343 .repartition(calculateNumberOfPartitions) 344 .write 345 .parquet(dataPath) 346 } {code} In all, developer indeed take a try to save the wordVector information, however it seems to break in pipeline as you said. So, could you give a example code to reproduce the bug? I'd like to dig deeper. > Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel > - > > Key: SPARK-21341 > URL: https://issues.apache.org/jira/browse/SPARK-21341 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Zied Sellami > > I am using sparContext.saveAsObjectFile to save a complex object containing a > pipelineModel with a Word2Vec ML Transformer. When I load the object and call > myPipelineModel.transform, Word2VecModel raise a null pointer error on line > 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by > removing@transient annotation on val wordVectors and @transient lazy val on > getVectors function. > -Why this 2 val are transient ? > -Any solution to add a boolean function on the Word2Vec Transformer to force > the serialization of wordVectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987 ] Yan Facai (颜发才) edited comment on SPARK-21341 at 7/8/17 6:28 AM: - Hi, [~zsellami]. I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see: {code} 336 override protected def saveImpl(path: String): Unit = { 337 DefaultParamsWriter.saveMetadata(instance, path, sc) 338 339 val wordVectors = instance.wordVectors.getVectors 340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } 341 val dataPath = new Path(path, "data").toString 342 sparkSession.createDataFrame(dataSeq) 343 .repartition(calculateNumberOfPartitions) 344 .write 345 .parquet(dataPath) 346 } {code} In all, developers indeed take a try to save the wordVector, however it seems to be broken in pipeline as you said. So, could you give an example code to reproduce the bug? I'd like to dig deeper. was (Author: facai): Hi, [~zsellami]. I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see: {code} 336 override protected def saveImpl(path: String): Unit = { 337 DefaultParamsWriter.saveMetadata(instance, path, sc) 338 339 val wordVectors = instance.wordVectors.getVectors 340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } 341 val dataPath = new Path(path, "data").toString 342 sparkSession.createDataFrame(dataSeq) 343 .repartition(calculateNumberOfPartitions) 344 .write 345 .parquet(dataPath) 346 } {code} In all, developers indeed take a try to save the wordVector, however it seems to break in pipeline as you said. So, could you give an example code to reproduce the bug? I'd like to dig deeper. > Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel > - > > Key: SPARK-21341 > URL: https://issues.apache.org/jira/browse/SPARK-21341 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Zied Sellami > > I am using sparContext.saveAsObjectFile to save a complex object containing a > pipelineModel with a Word2Vec ML Transformer. When I load the object and call > myPipelineModel.transform, Word2VecModel raise a null pointer error on line > 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by > removing@transient annotation on val wordVectors and @transient lazy val on > getVectors function. > -Why this 2 val are transient ? > -Any solution to add a boolean function on the Word2Vec Transformer to force > the serialization of wordVectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21341) Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel
[ https://issues.apache.org/jira/browse/SPARK-21341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16078987#comment-16078987 ] Yan Facai (颜发才) commented on SPARK-21341: - Hi, [~zsellami]. I guess that since the wordVectors is mllib model in fact, which might be removed in the future, so it is marked private and transient. More interestingly, wordVectors are saved in data folder as dataframe, see: {code} 336 override protected def saveImpl(path: String): Unit = { 337 DefaultParamsWriter.saveMetadata(instance, path, sc) 338 339 val wordVectors = instance.wordVectors.getVectors 340 val dataSeq = wordVectors.toSeq.map { case (word, vector) => Data(word, vector) } 341 val dataPath = new Path(path, "data").toString 342 sparkSession.createDataFrame(dataSeq) 343 .repartition(calculateNumberOfPartitions) 344 .write 345 .parquet(dataPath) 346 } {code} In all, developer indeed take a try to save the wordVector information, however it seems to break in pipeline as you said. So, could you give a example code to reproduce the bug? I'd like to dig deeper. > Spark 2.1.1: I want to be able to serialize wordVectors on Word2VecModel > - > > Key: SPARK-21341 > URL: https://issues.apache.org/jira/browse/SPARK-21341 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: Zied Sellami > > I am using sparContext.saveAsObjectFile to save a complex object containing a > pipelineModel with a Word2Vec ML Transformer. When I load the object and call > myPipelineModel.transform, Word2VecModel raise a null pointer error on line > 292 Word2Vec.scala "wordVectors.getVectors" . I resolve the problem by > removing@transient annotation on val wordVectors and @transient lazy val on > getVectors function. > -Why this 2 val are transient ? > -Any solution to add a boolean function on the Word2Vec Transformer to force > the serialization of wordVectors. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21331) java.lang.NullPointerException for certain methods in classes of MLlib
[ https://issues.apache.org/jira/browse/SPARK-21331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077620#comment-16077620 ] Yan Facai (颜发才) commented on SPARK-21331: - [~anirband] How about using this code? {code} val conf = new SparkConf().setAppName("MySparkApp").setMaster("local") {code} > java.lang.NullPointerException for certain methods in classes of MLlib > -- > > Key: SPARK-21331 > URL: https://issues.apache.org/jira/browse/SPARK-21331 > Project: Spark > Issue Type: Bug > Components: Build, MLlib >Affects Versions: 2.1.0 > Environment: Spark running locally on OSX at spark://127.0.0.1:7077. >Reporter: Anirban Das > > I am trying to run the following code using sbt package and sbt run. I am > getting a runtime error that seems to be a bug since the same code works > great on spark-shell with Scala. The error occurs when executing the > computeSVD line. If this line is commented out, the program works fine. I am > having similar issues for other methods for classes in MLlib as well. This > looks like a bug to me. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21331) java.lang.NullPointerException for certain methods in classes of MLlib
[ https://issues.apache.org/jira/browse/SPARK-21331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077605#comment-16077605 ] Yan Facai (颜发才) edited comment on SPARK-21331 at 7/7/17 5:21 AM: - Hi, I run the code in description on mac, spark-2.1.1. It works out successfully without any exception. {code} scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(2, computeU = true) 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] = SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@5a4f7911,[13.029275535600473,5.368578733451684],-0.31278534337232633 0.3116713569157832 -0.029801450130953977 -0.17133211263608739 -0.12207248163673157 0.15256470925290191 -0.7184789931874109-0.6809628499946365 -0.60841059171993640.6217072292290715) {code} was (Author: facai): Hi, I run the code in description on mac, spark-2.1.1. It works out successfully without any exception. {block} scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(2, computeU = true) 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] = SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@5a4f7911,[13.029275535600473,5.368578733451684],-0.31278534337232633 0.3116713569157832 -0.029801450130953977 -0.17133211263608739 -0.12207248163673157 0.15256470925290191 -0.7184789931874109-0.6809628499946365 -0.60841059171993640.6217072292290715) {block} > java.lang.NullPointerException for certain methods in classes of MLlib > -- > > Key: SPARK-21331 > URL: https://issues.apache.org/jira/browse/SPARK-21331 > Project: Spark > Issue Type: Bug > Components: Build, MLlib >Affects Versions: 2.1.0 > Environment: Spark running locally on OSX at spark://127.0.0.1:7077. >Reporter: Anirban Das > > I am trying to run the following code using sbt package and sbt run. I am > getting a runtime error that seems to be a bug since the same code works > great on spark-shell with Scala. The error occurs when executing the > computeSVD line. If this line is commented out, the program works fine. I am > having similar issues for other methods for classes in MLlib as well. This > looks like a bug to me. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21331) java.lang.NullPointerException for certain methods in classes of MLlib
[ https://issues.apache.org/jira/browse/SPARK-21331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16077605#comment-16077605 ] Yan Facai (颜发才) commented on SPARK-21331: - Hi, I run the code in description on mac, spark-2.1.1. It works out successfully without any exception. {block} scala> val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(2, computeU = true) 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 17/07/07 13:17:12 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeSystemLAPACK 17/07/07 13:17:12 WARN netlib.LAPACK: Failed to load implementation from: com.github.fommil.netlib.NativeRefLAPACK svd: org.apache.spark.mllib.linalg.SingularValueDecomposition[org.apache.spark.mllib.linalg.distributed.RowMatrix,org.apache.spark.mllib.linalg.Matrix] = SingularValueDecomposition(org.apache.spark.mllib.linalg.distributed.RowMatrix@5a4f7911,[13.029275535600473,5.368578733451684],-0.31278534337232633 0.3116713569157832 -0.029801450130953977 -0.17133211263608739 -0.12207248163673157 0.15256470925290191 -0.7184789931874109-0.6809628499946365 -0.60841059171993640.6217072292290715) {block} > java.lang.NullPointerException for certain methods in classes of MLlib > -- > > Key: SPARK-21331 > URL: https://issues.apache.org/jira/browse/SPARK-21331 > Project: Spark > Issue Type: Bug > Components: Build, MLlib >Affects Versions: 2.1.0 > Environment: Spark running locally on OSX at spark://127.0.0.1:7077. >Reporter: Anirban Das > > I am trying to run the following code using sbt package and sbt run. I am > getting a runtime error that seems to be a bug since the same code works > great on spark-shell with Scala. The error occurs when executing the > computeSVD line. If this line is commented out, the program works fine. I am > having similar issues for other methods for classes in MLlib as well. This > looks like a bug to me. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
[ https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075974#comment-16075974 ] Yan Facai (颜发才) commented on SPARK-21306: - [~cathalgarvey] By the way, since LogisticRegression in ml supports multi-class directly by using multinomial loss, I believe that it might be better than OneVsRest. > OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier > > > Key: SPARK-21306 > URL: https://issues.apache.org/jira/browse/SPARK-21306 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Cathal Garvey >Priority: Minor > Labels: classification, ml > > Hi folks, thanks for Spark! :) > I've been learning to use `ml` and `mllib`, and I've encountered a block > while trying to use `ml.classification.OneVsRest` with > `ml.classification.LogisticRegression`. Basically, [here in the > code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320], > only two columns are being extracted and fed to the underlying classifiers.. > however with some configurations, more than two columns are required. > Specifically: I want to do multiclass learning with Logistic Regression, on a > very imbalanced dataset. In my dataset, I have lots of imbalances, so I was > planning to use weights. I set a column, `"weight"`, as the inverse frequency > of each field, and I configured my `LogisticRegression` class to use this > column, then put it in a `OneVsRest` wrapper. > However, `OneVsRest` strips all but two columns out of a dataset before > training, so I get an error from within `LogisticRegression` that it can't > find the `"weight"` column. > It would be nice to have this fixed! I can see a few ways, but a very > conservative fix would be to include a parameter in `OneVsRest.fit` for > additional columns to `select` before passing to the underlying model. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
[ https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075970#comment-16075970 ] Yan Facai (颜发才) edited comment on SPARK-21306 at 7/6/17 5:40 AM: - I agree with [~mlnick]. It seems that we will get in trouble only when `weight` is expected. How about adding a `setWeightCol` like LogisticRegression? was (Author: facai): I agree with [~n...@svana.org]. It seems that we will get in trouble only when `weight` is expected. How about adding a `setWeightCol` like LogisticRegression? > OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier > > > Key: SPARK-21306 > URL: https://issues.apache.org/jira/browse/SPARK-21306 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Cathal Garvey >Priority: Minor > Labels: classification, ml > > Hi folks, thanks for Spark! :) > I've been learning to use `ml` and `mllib`, and I've encountered a block > while trying to use `ml.classification.OneVsRest` with > `ml.classification.LogisticRegression`. Basically, [here in the > code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320], > only two columns are being extracted and fed to the underlying classifiers.. > however with some configurations, more than two columns are required. > Specifically: I want to do multiclass learning with Logistic Regression, on a > very imbalanced dataset. In my dataset, I have lots of imbalances, so I was > planning to use weights. I set a column, `"weight"`, as the inverse frequency > of each field, and I configured my `LogisticRegression` class to use this > column, then put it in a `OneVsRest` wrapper. > However, `OneVsRest` strips all but two columns out of a dataset before > training, so I get an error from within `LogisticRegression` that it can't > find the `"weight"` column. > It would be nice to have this fixed! I can see a few ways, but a very > conservative fix would be to include a parameter in `OneVsRest.fit` for > additional columns to `select` before passing to the underlying model. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21306) OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier
[ https://issues.apache.org/jira/browse/SPARK-21306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075970#comment-16075970 ] Yan Facai (颜发才) commented on SPARK-21306: - I agree with [~n...@svana.org]. It seems that we will get in trouble only when `weight` is expected. How about adding a `setWeightCol` like LogisticRegression? > OneVsRest Conceals Columns That May Be Relevant To Underlying Classifier > > > Key: SPARK-21306 > URL: https://issues.apache.org/jira/browse/SPARK-21306 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.1.1 >Reporter: Cathal Garvey >Priority: Minor > Labels: classification, ml > > Hi folks, thanks for Spark! :) > I've been learning to use `ml` and `mllib`, and I've encountered a block > while trying to use `ml.classification.OneVsRest` with > `ml.classification.LogisticRegression`. Basically, [here in the > code|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/classification/OneVsRest.scala#L320], > only two columns are being extracted and fed to the underlying classifiers.. > however with some configurations, more than two columns are required. > Specifically: I want to do multiclass learning with Logistic Regression, on a > very imbalanced dataset. In my dataset, I have lots of imbalances, so I was > planning to use weights. I set a column, `"weight"`, as the inverse frequency > of each field, and I configured my `LogisticRegression` class to use this > column, then put it in a `OneVsRest` wrapper. > However, `OneVsRest` strips all but two columns out of a dataset before > training, so I get an error from within `LogisticRegression` that it can't > find the `"weight"` column. > It would be nice to have this fixed! I can see a few ways, but a very > conservative fix would be to include a parameter in `OneVsRest.fit` for > additional columns to `select` before passing to the underlying model. > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21285) VectorAssembler should report the column name when data type used is not supported
[ https://issues.apache.org/jira/browse/SPARK-21285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16073125#comment-16073125 ] Yan Facai (颜发才) commented on SPARK-21285: - It seems easy, and I can work on this. > VectorAssembler should report the column name when data type used is not > supported > -- > > Key: SPARK-21285 > URL: https://issues.apache.org/jira/browse/SPARK-21285 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.1 >Reporter: Jacek Laskowski >Priority: Minor > > Found while answering [Why does LogisticRegression fail with > “IllegalArgumentException: > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7”?|https://stackoverflow.com/q/44844793/1305344] > on StackOverflow. > When {{VectorAssembler}} is configured to use columns of unsupported type > only the type is printed out without the column name(s). > The column name(s) should be included too. > {code} > // label is of StringType type > val va = new VectorAssembler().setInputCols(Array("bc", "pmi", "label")) > scala> va.transform(training) > java.lang.IllegalArgumentException: Data type StringType is not supported. > at > org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:121) > at > org.apache.spark.ml.feature.VectorAssembler$$anonfun$transformSchema$1.apply(VectorAssembler.scala:117) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:117) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74) > at > org.apache.spark.ml.feature.VectorAssembler.transform(VectorAssembler.scala:54) > ... 48 elided > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16060381#comment-16060381 ] Yan Facai (颜发才) commented on SPARK-21066: - Downgrade to Trivial since `numFeatures` should work. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Trivial > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-21066: Priority: Trivial (was: Major) > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet >Priority: Trivial > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055336#comment-16055336 ] Yan Facai (颜发才) edited comment on SPARK-21066 at 6/20/17 8:27 AM: -- [~sowen] I believe that the API has explained well in details. If unspecified or nonpositive, the number of features will be determined automatically at the cost of one additional pass. The best way to solve the problem is to modify the misleading message of exception: to suggest user to specify `numFeatures`, rather than warn user to go away. was (Author: facai): [~sowen] I believe that the API has explained well in details. If unspecified or nonpositive, the number of features will be determined automatically at the cost of one additional pass. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055336#comment-16055336 ] Yan Facai (颜发才) commented on SPARK-21066: - [~sowen] I believe that the API has explained well in details. If unspecified or nonpositive, the number of features will be determined automatically at the cost of one additional pass. > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055328#comment-16055328 ] Yan Facai (颜发才) commented on SPARK-21066: - Hi, [~darion] . If `numFeature` is specified, multiple files are OK. {code} val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") {code} see: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21066) LibSVM load just one input file
[ https://issues.apache.org/jira/browse/SPARK-21066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16055328#comment-16055328 ] Yan Facai (颜发才) edited comment on SPARK-21066 at 6/20/17 8:12 AM: -- Hi, [~darion] . If `numFeatures` is specified, multiple files are OK. {code} val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") {code} see: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource was (Author: facai): Hi, [~darion] . If `numFeature` is specified, multiple files are OK. {code} val df = spark.read.format("libsvm") .option("numFeatures", "780") .load("data/mllib/sample_libsvm_data.txt") {code} see: http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.source.libsvm.LibSVMDataSource > LibSVM load just one input file > --- > > Key: SPARK-21066 > URL: https://issues.apache.org/jira/browse/SPARK-21066 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.1 >Reporter: darion yaphet > > Currently when we using SVM to train dataset we found the input files limit > only one . > The file store on the Distributed File System such as HDFS is split into > mutil piece and I think this limit is not necessary . > We can join input paths into a string split with comma. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20787) PySpark can't handle datetimes before 1900
[ https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16028253#comment-16028253 ] Yan Facai (颜发才) commented on SPARK-20787: - Just go head, [~RBerenguel] ! The issue is beyond my scope and I am not interested now. > PySpark can't handle datetimes before 1900 > -- > > Key: SPARK-20787 > URL: https://issues.apache.org/jira/browse/SPARK-20787 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0, 2.1.1 >Reporter: Keith Bourgoin > > When trying to put a datetime before 1900 into a DataFrame, it throws an > error because of the use of time.mktime. > {code} > Python 2.7.13 (default, Mar 8 2017, 17:29:55) > Type "copyright", "credits" or "license" for more information. > IPython 5.3.0 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Mar 8 2017 17:29:55) > SparkSession available as 'spark'. > In [1]: import datetime as dt > In [2]: > sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() > 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 174, in main > process() > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 169, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 268, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in toInternal > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 436, in toInternal > return self.dataType.toInternal(obj) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 191, in toInternal > else time.mktime(dt.timetuple())) > ValueError: year out of range > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) >
[jira] [Comment Edited] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015345#comment-16015345 ] Yan Facai (颜发才) edited comment on SPARK-19581 at 5/26/17 9:02 AM: -- [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. {code} val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more {code} 2) For 692 size of empty feature, it's OK. {code} scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ {code} was (Author: facai): [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. ```scala val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more ``` 2) For 692 size of empty feature, it's OK. ```scala scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ ``` > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features
[jira] [Commented] (SPARK-20498) RandomForestRegressionModel should expose getMaxDepth in PySpark
[ https://issues.apache.org/jira/browse/SPARK-20498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025949#comment-16025949 ] Yan Facai (颜发才) commented on SPARK-20498: - [~iamshrek] Hi, Xin Ren. As the task is quite easy, if you are a little busy, I'm glad to work on it. Is it OK? > RandomForestRegressionModel should expose getMaxDepth in PySpark > > > Key: SPARK-20498 > URL: https://issues.apache.org/jira/browse/SPARK-20498 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.1.0 >Reporter: Nick Lothian >Assignee: Xin Ren >Priority: Minor > > Currently it isn't clear hot to get the max depth of a > RandomForestRegressionModel (eg, after doing a grid search) > It is possible to call > {{regressor._java_obj.getMaxDepth()}} > but most other decision trees allow > {{regressor.getMaxDepth()}} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20787) PySpark can't handle datetimes before 1900
[ https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025853#comment-16025853 ] Yan Facai (颜发才) edited comment on SPARK-20787 at 5/26/17 6:03 AM: -- It seems that the exception is raised when serialization. Microsecond is used as internal representation of Timestamp in pyspark. As scala works well, hence there must be something wrong in python code. {code} scala> val df = Seq("05/26/1800 01:01:01").toDF("dt") scala> val ts = unix_timestamp($"dt","MM/dd/ HH:mm:ss").cast("timestamp") scala> df.withColumn("ts", ts).show(false) +---+-+ |dt |ts | +---+-+ |05/26/1800 01:01:01|1800-05-26 01:01:01.0| +---+-+ {code} was (Author: facai): It seems that the exception is raised when serialization. Microsecond is used as internal representation of Timestamp in pyspark. As scala works well, hence there must be something wrong in python code. scala> val df = Seq("05/26/1800 01:01:01").toDF("dt") scala> val ts = unix_timestamp($"dt","MM/dd/ HH:mm:ss").cast("timestamp") scala> df.withColumn("ts", ts).show(false) +---+-+ |dt |ts | +---+-+ |05/26/1800 01:01:01|1800-05-26 01:01:01.0| +---+-+ > PySpark can't handle datetimes before 1900 > -- > > Key: SPARK-20787 > URL: https://issues.apache.org/jira/browse/SPARK-20787 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0, 2.1.1 >Reporter: Keith Bourgoin > > When trying to put a datetime before 1900 into a DataFrame, it throws an > error because of the use of time.mktime. > {code} > Python 2.7.13 (default, Mar 8 2017, 17:29:55) > Type "copyright", "credits" or "license" for more information. > IPython 5.3.0 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Mar 8 2017 17:29:55) > SparkSession available as 'spark'. > In [1]: import datetime as dt > In [2]: > sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() > 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 174, in main > process() > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 169, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 268, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in toInternal > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 436, in toInternal > return self.dataType.toInternal(obj) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 191, in toInternal > else time.mktime(dt.timet
[jira] [Commented] (SPARK-20787) PySpark can't handle datetimes before 1900
[ https://issues.apache.org/jira/browse/SPARK-20787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025853#comment-16025853 ] Yan Facai (颜发才) commented on SPARK-20787: - It seems that the exception is raised when serialization. Microsecond is used as internal representation of Timestamp in pyspark. As scala works well, hence there must be something wrong in python code. scala> val df = Seq("05/26/1800 01:01:01").toDF("dt") scala> val ts = unix_timestamp($"dt","MM/dd/ HH:mm:ss").cast("timestamp") scala> df.withColumn("ts", ts).show(false) +---+-+ |dt |ts | +---+-+ |05/26/1800 01:01:01|1800-05-26 01:01:01.0| +---+-+ > PySpark can't handle datetimes before 1900 > -- > > Key: SPARK-20787 > URL: https://issues.apache.org/jira/browse/SPARK-20787 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 2.1.0, 2.1.1 >Reporter: Keith Bourgoin > > When trying to put a datetime before 1900 into a DataFrame, it throws an > error because of the use of time.mktime. > {code} > Python 2.7.13 (default, Mar 8 2017, 17:29:55) > Type "copyright", "credits" or "license" for more information. > IPython 5.3.0 -- An enhanced Interactive Python. > ? -> Introduction and overview of IPython's features. > %quickref -> Quick reference. > help -> Python's own help system. > object? -> Details about 'object', use 'object??' for extra details. > Setting default log level to "WARN". > To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use > setLogLevel(newLevel). > 17/05/17 12:45:59 WARN NativeCodeLoader: Unable to load native-hadoop library > for your platform... using builtin-java classes where applicable > 17/05/17 12:46:02 WARN ObjectStore: Failed to get database global_temp, > returning NoSuchObjectException > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/__ / .__/\_,_/_/ /_/\_\ version 2.1.0 > /_/ > Using Python version 2.7.13 (default, Mar 8 2017 17:29:55) > SparkSession available as 'spark'. > In [1]: import datetime as dt > In [2]: > sqlContext.createDataFrame(sc.parallelize([[dt.datetime(1899,12,31)]])).count() > 17/05/17 12:46:16 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7) > org.apache.spark.api.python.PythonException: Traceback (most recent call > last): > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 174, in main > process() > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/worker.py", line > 169, in process > serializer.dump_stream(func(split_index, iterator), outfile) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/serializers.py", > line 268, in dump_stream > vs = list(itertools.islice(iterator, batch)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in toInternal > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File "/home/kfb/src/projects/spark/python/pyspark/sql/types.py", line 576, > in > return tuple(f.toInternal(v) for f, v in zip(self.fields, obj)) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 436, in toInternal > return self.dataType.toInternal(obj) > File > "/home/kfb/src/projects/spark/python/lib/pyspark.zip/pyspark/sql/types.py", > line 191, in toInternal > else time.mktime(dt.timetuple())) > ValueError: year out of range > at > org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193) > at > org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:234) > at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) > at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) > at > org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) > at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) > at org.apache.spar
[jira] [Comment Edited] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448 ] Yan Facai (颜发才) edited comment on SPARK-20768 at 5/18/17 8:59 AM: -- It seems easy, I can work on it. However, I'm on holiday this weekend. Is it OK to wait one week? was (Author: facai): It seems easy, I can work on it. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015448#comment-16015448 ] Yan Facai (颜发才) commented on SPARK-20768: - It seems easy, I can work on it. > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20768) PySpark FPGrowth does not expose numPartitions (expert) param
[ https://issues.apache.org/jira/browse/SPARK-20768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015437#comment-16015437 ] Yan Facai (颜发才) commented on SPARK-20768: - Hi, I'm newbie. `numPartitions` is found in pyspark code, could you explain more details? thanks. ```python def __init__(self, minSupport=0.3, minConfidence=0.8, itemsCol="items", predictionCol="prediction", numPartitions=None): ``` > PySpark FPGrowth does not expose numPartitions (expert) param > -- > > Key: SPARK-20768 > URL: https://issues.apache.org/jira/browse/SPARK-20768 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Affects Versions: 2.2.0 >Reporter: Nick Pentreath >Priority: Minor > > The PySpark API for {{FPGrowth}} does not expose the {{numPartitions}} param. > While it is an "expert" param, the general approach elsewhere is to expose > these on the Python side (e.g. {{aggregationDepth}} and intermediate storage > params in {{ALS}}) -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16015345#comment-16015345 ] Yan Facai (颜发才) commented on SPARK-19581: - [~barrybecker4] Hi, Becker. I can't reproduce the bug on spark-2.1.1-bin-hadoop2.7. 1) For 0 size of feature, the exception is harmless. ```scala val data = spark.read.format("libsvm").load("/user/facai/data/libsvm/sample_libsvm_data.txt").cache import org.apache.spark.ml.classification.NaiveBayes val model = new NaiveBayes().fit(data) import org.apache.spark.ml.linalg.{Vectors => SV} case class TestData(features: org.apache.spark.ml.linalg.Vector) val emptyVector = SV.sparse(0, Array.empty[Int], Array.empty[Double]) val test = Seq(TestData(emptyVector)).toDF scala> test.show +-+ | features| +-+ |(0,[],[])| +-+ scala> model.transform(test).show org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (vector) => vector) at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1072) ... 48 elided Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 692, x: 0 at scala.Predef$.require(Predef.scala:224) ... 99 more ``` 2) For 692 size of empty feature, it's OK. ```scala scala> val emptyVector = SV.sparse(692, Array.empty[Int], Array.empty[Double]) emptyVector: org.apache.spark.ml.linalg.Vector = (692,[],[]) scala> val test = Seq(TestData(emptyVector)).toDF test: org.apache.spark.sql.DataFrame = [features: vector] scala> test.show +---+ | features| +---+ |(692,[],[])| +---+ scala> model.transform(test).show +---+++--+ | features| rawPrediction| probability|prediction| +---+++--+ |(692,[],[])|[-0.8407831793660...|[0.43137254901960...| 1.0| +---+++--+ ``` > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervi
[jira] [Commented] (SPARK-19581) running NaiveBayes model with 0 features can crash the executor with D rorreGEMV
[ https://issues.apache.org/jira/browse/SPARK-19581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16001987#comment-16001987 ] Yan Facai (颜发才) commented on SPARK-19581: - [~barrybecker4] Could you give a sample code? > running NaiveBayes model with 0 features can crash the executor with D > rorreGEMV > > > Key: SPARK-19581 > URL: https://issues.apache.org/jira/browse/SPARK-19581 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 > Environment: spark development or standalone mode on windows or linux. >Reporter: Barry Becker >Priority: Minor > > The severity of this bug is high (because nothing should cause spark to crash > like this) but the priority may be low (because there is an easy workaround). > In our application, a user can select features and a target to run the > NaiveBayes inducer. If columns have too many values or all one value, they > will be removed before we call the inducer to create the model. As a result, > there are some cases, where all the features may get removed. When this > happens, executors will crash and get restarted (if on a cluster) or spark > will crash and need to be manually restarted (if in development mode). > It looks like NaiveBayes uses BLAS, and BLAS does not handle this case well > when it is encountered. I emits this vague error : > ** On entry to DGEMV parameter number 6 had an illegal value > and terminates. > My code looks like this: > {code} >val predictions = model.transform(testData) // Make predictions > // figure out how many were correctly predicted > val numCorrect = predictions.filter(new Column(actualTarget) === new > Column(PREDICTION_LABEL_COLUMN)).count() > val numIncorrect = testRowCount - numCorrect > {code} > The failure is at the line that does the count, but it is not the count that > causes the problem, it is the model.transform step (where the model contains > the NaiveBayes classifier). > Here is the stack trace (in development mode): > {code} > [2017-02-13 06:28:39,946] TRACE evidence.EvidenceVizModel$ [] > [akka://JobServer/user/context-supervisor/sql-context] - done making > predictions in 232 > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > ** On entry to DGEMV parameter number 6 had an illegal value > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event SparkListenerSQLExecutionEnd(9,1486996120505) > [2017-02-13 06:28:40,506] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@1f6c4a29) > [2017-02-13 06:28:40,508] ERROR .scheduler.LiveListenerBus [] > [akka://JobServer/user/context-supervisor/sql-context] - SparkListenerBus has > already stopped! Dropping event > SparkListenerJobEnd(12,1486996120507,JobFailed(org.apache.spark.SparkException: > Job 12 cancelled because SparkContext was shut down)) > [2017-02-13 06:28:40,509] ERROR .jobserver.JobManagerActor [] > [akka://JobServer/user/context-supervisor/sql-context] - Got Throwable > org.apache.spark.SparkException: Job 12 cancelled because SparkContext was > shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:808) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:806) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:806) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1668) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at > org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1587) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1826) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1825) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581) > at > org.apache.spark.util.SparkShutdownHook
[jira] [Commented] (SPARK-20526) Load doesn't work in PCAModel
[ https://issues.apache.org/jira/browse/SPARK-20526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15992367#comment-15992367 ] Yan Facai (颜发才) commented on SPARK-20526: - Can you give a sample code? > Load doesn't work in PCAModel > -- > > Key: SPARK-20526 > URL: https://issues.apache.org/jira/browse/SPARK-20526 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Windows >Reporter: Hayri Volkan Agun > Original Estimate: 336h > Remaining Estimate: 336h > > Error occurs during loading PCAModel. Saved model doesn't load. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990195#comment-15990195 ] Yan Facai (颜发才) edited comment on SPARK-16957 at 4/30/17 11:28 AM: --- To match the other libraries, we use mean value for now and decide later to make it weighted. [~srowen] [~sethah] For more details see the discuss in Github PR 17556: https://github.com/apache/spark/pull/17556 was (Author: facai): To match the other libraries, we use mean value for now and decide later to make it weighted. [~srowen] [~sethah] > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15990195#comment-15990195 ] Yan Facai (颜发才) commented on SPARK-16957: - To match the other libraries, we use mean value for now and decide later to make it weighted. [~srowen] [~sethah] > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161 ] Yan Facai (颜发才) edited comment on SPARK-20199 at 4/26/17 6:11 AM: -- The work is easy, however **public method** is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until at least one of Committers agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? was (Author: facai): The work is easy, however Public method is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until one expert agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984161#comment-15984161 ] Yan Facai (颜发才) commented on SPARK-20199: - The work is easy, however Public method is added and some adjustments are needed in inner implementation. Hence, I suggest to delay it until one expert agree to shepherd the issue. I have two questions: 1. For both GBDT and RandomForest share the attribute, we can pull `featureSubsetStrategy` parameter up to either TreeEnsembleParams or DecisionTreeParams. Which one is appropriate? 2. Is it right to add new parameter `featureSubsetStrategy` to Strategy class? Or add it to DecisionTree's train method? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15980268#comment-15980268 ] Yan Facai (颜发才) commented on SPARK-16957: - [~vlad.feinberg] Hi, I found that R's gbm uses mean value, instead of weighted mean. Hence the first phrase is removed in the description. > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-16957: Description: We should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} was: Just like R's gbm, we should be using weighted split points rather than the actual continuous binned feature values. For instance, in a dataset containing binary features (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some smoothness qualities, this is asymptotically bad compared to GBM's approach. The split point should be a weighted split point of the two values of the "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at {{0.75}}. Example: {code} +++-+-+ |feature0|feature1|label|count| +++-+-+ | 0.0| 0.0| 0.0| 23| | 1.0| 0.0| 0.0|2| | 0.0| 0.0| 1.0|2| | 0.0| 1.0| 0.0|7| | 1.0| 0.0| 1.0| 23| | 0.0| 1.0| 1.0| 18| | 1.0| 1.0| 1.0|7| | 1.0| 1.0| 0.0| 18| +++-+-+ DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes If (feature 0 <= 0.0) If (feature 1 <= 0.0) Predict: -0.56 Else (feature 1 > 0.0) Predict: 0.29333 Else (feature 0 > 0.0) If (feature 1 <= 0.0) Predict: 0.56 Else (feature 1 > 0.0) Predict: -0.29333 {code} > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > We should be using weighted split points rather than the actual continuous > binned feature values. For instance, in a dataset containing binary features > (that are fed in as continuous ones), our splits are selected as {{x <= 0.0}} > and {{x > 0.0}}. For any real data with some smoothness qualities, this is > asymptotically bad compared to GBM's approach. The split point should be a > weighted split point of the two values of the "innermost" feature bins; e.g., > if there are 30 {{x = 0}} and 10 {{x = 1}}, the above split should be at > {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15976098#comment-15976098 ] Yan Facai (颜发才) commented on SPARK-20081: - By the way, for StringIndexer, numerical label column will be cast to string and resort by count. if you had known of all possible labels and want to use custom order (say, lexicographical or natural order) , it's better to construct StringIndexerModel by yourself. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969622#comment-15969622 ] Yan Facai (颜发才) commented on SPARK-20199: - ping [~jkbreuer] [~sethah] [~mengxr]. Which one is better? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:42 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. If you like to dig deeper, more details to see: org.apache.spark.ml.attribute.Attribute org.apache.spark.ml.attribute.NominalAttribute Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967347#comment-15967347 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 9:40 AM: -- Yes, you should use `builder.putLong("num_vals", numClasses).putString("type", "nominal")`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. was (Author: facai): Yes, you should use `builder.putLong("num_vals", numClasses)`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967347#comment-15967347 ] Yan Facai (颜发才) commented on SPARK-20081: - Yes, you should use `builder.putLong("num_vals", numClasses)`. A little hacky, and it might not work. I am not familiar with Metadata and Attribute class at present. Some experts perhaps have a better solution, unfortunately, I have no idea. Use StringIndexer with your label column should work well, which take care of itself, I guess. > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19141) VectorAssembler metadata causing memory issues
[ https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967246#comment-15967246 ] Yan Facai (颜发才) commented on SPARK-19141: - `VectorAssembler` will create attribute (name) for each bit in Vector to be merged. If the dimensions is too large, metadata can run out memory, though Vector is sparse indeed. I believe the issue is critical, since `VectorAssembler` is the main method for constructing feature vector. > VectorAssembler metadata causing memory issues > -- > > Key: SPARK-19141 > URL: https://issues.apache.org/jira/browse/SPARK-19141 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.0, 2.0.0, 2.1.0 > Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, > 2.0.0, 2.1.0 >Reporter: Antonia Oprescu > > VectorAssembler produces unnecessary metadata that overflows the Java heap in > the case of sparse vectors. In the example below, the logical length of the > vector is 10^6, but the number of non-zero values is only 2. > The problem arises when the vector assembler creates metadata (ML attributes) > for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. > HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it > produces: > {noformat} > {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}} > {noformat} > In this lightweight example, the feature size limit seems to be 1,000,000 > when run locally, but this scales poorly with more complicated routines. With > a larger dataset and a learner (say LogisticRegression), it maxes out > anywhere between 10k and 100k hash size even on a decent sized cluster. > I did some digging, and it seems that the only metadata necessary for > downstream learners is the one indicating categorical columns. Thus, I > thought of the following possible solutions: > 1. Compact representation of ml attributes metadata (but this seems to be a > bigger change) > 2. Removal of non-categorical tags from the metadata created by the > VectorAssembler > 3. An option on the existent VectorAssembler to skip unnecessary ml > attributes or create another transformer altogether > I would happy to take a stab at any of these solutions, but I need some > direction from the Spark community. > {code:title=VABug.scala |borderStyle=solid} > import org.apache.spark.SparkConf > import org.apache.spark.ml.feature.{HashingTF, VectorAssembler} > import org.apache.spark.sql.SparkSession > object VARepro { > case class Record(Label: Double, Feat01: Double, Feat02: Array[String]) > def main(args: Array[String]) { > val conf = new SparkConf() > .setAppName("Vector assembler bug") > .setMaster("local[*]") > val spark = SparkSession.builder.config(conf).getOrCreate() > import spark.implicits._ > val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, > Array("a9ee"))).toDS() > val numFeatures = 1000 > val hashingScheme = new > HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures) > val hashedData = hashingScheme.transform(df) > val vectorAssembler = new > VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features") > val processedData = vectorAssembler.transform(hashedData).select("Label", > "Features") > processedData.show() > } > } > {code} > *Stacktrace from the example above:* > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit > exceeded > at > org.apache.spark.ml.attribute.NumericAttribute.copy(attributes.scala:272) > at > org.apache.spark.ml.attribute.NumericAttribute.withIndex(attributes.scala:215)
[jira] [Comment Edited] (SPARK-19141) VectorAssembler metadata causing memory issues
[ https://issues.apache.org/jira/browse/SPARK-19141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967246#comment-15967246 ] Yan Facai (颜发才) edited comment on SPARK-19141 at 4/13/17 7:42 AM: -- `VectorAssembler` will create attribute (name) for each bit in Vector. If the dimensions is too large, metadata can run out memory, though Vector is sparse indeed. I believe the issue is critical, since `VectorAssembler` is the main method for constructing feature vector. was (Author: facai): `VectorAssembler` will create attribute (name) for each bit in Vector to be merged. If the dimensions is too large, metadata can run out memory, though Vector is sparse indeed. I believe the issue is critical, since `VectorAssembler` is the main method for constructing feature vector. > VectorAssembler metadata causing memory issues > -- > > Key: SPARK-19141 > URL: https://issues.apache.org/jira/browse/SPARK-19141 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 1.6.0, 2.0.0, 2.1.0 > Environment: Windows 10, Ubuntu 16.04.1, Scala 2.11.8, Spark 1.6.0, > 2.0.0, 2.1.0 >Reporter: Antonia Oprescu > > VectorAssembler produces unnecessary metadata that overflows the Java heap in > the case of sparse vectors. In the example below, the logical length of the > vector is 10^6, but the number of non-zero values is only 2. > The problem arises when the vector assembler creates metadata (ML attributes) > for each of the 10^6 slots, even if this metadata didn't exist upstream (i.e. > HashingTF doesn't produce metadata per slot). Here is a chunk of metadata it > produces: > {noformat} > {"ml_attr":{"attrs":{"numeric":[{"idx":0,"name":"HashedFeat_0"},{"idx":1,"name":"HashedFeat_1"},{"idx":2,"name":"HashedFeat_2"},{"idx":3,"name":"HashedFeat_3"},{"idx":4,"name":"HashedFeat_4"},{"idx":5,"name":"HashedFeat_5"},{"idx":6,"name":"HashedFeat_6"},{"idx":7,"name":"HashedFeat_7"},{"idx":8,"name":"HashedFeat_8"},{"idx":9,"name":"HashedFeat_9"},...,{"idx":100,"name":"Feat01"}]},"num_attrs":101}} > {noformat} > In this lightweight example, the feature size limit seems to be 1,000,000 > when run locally, but this scales poorly with more complicated routines. With > a larger dataset and a learner (say LogisticRegression), it maxes out > anywhere between 10k and 100k hash size even on a decent sized cluster. > I did some digging, and it seems that the only metadata necessary for > downstream learners is the one indicating categorical columns. Thus, I > thought of the following possible solutions: > 1. Compact representation of ml attributes metadata (but this seems to be a > bigger change) > 2. Removal of non-categorical tags from the metadata created by the > VectorAssembler > 3. An option on the existent VectorAssembler to skip unnecessary ml > attributes or create another transformer altogether > I would happy to take a stab at any of these solutions, but I need some > direction from the Spark community. > {code:title=VABug.scala |borderStyle=solid} > import org.apache.spark.SparkConf > import org.apache.spark.ml.feature.{HashingTF, VectorAssembler} > import org.apache.spark.sql.SparkSession > object VARepro { > case class Record(Label: Double, Feat01: Double, Feat02: Array[String]) > def main(args: Array[String]) { > val conf = new SparkConf() > .setAppName("Vector assembler bug") > .setMaster("local[*]") > val spark = SparkSession.builder.config(conf).getOrCreate() > import spark.implicits._ > val df = Seq(Record(1.0, 2.0, Array("4daf")), Record(0.0, 3.0, > Array("a9ee"))).toDS() > val numFeatures = 1000 > val hashingScheme = new > HashingTF().setInputCol("Feat02").setOutputCol("HashedFeat").setNumFeatures(numFeatures) > val hashedData = hashingScheme.transform(df) > val vectorAssembler = new > VectorAssembler().setInputCols(Array("HashedFeat","Feat01")).setOutputCol("Features") > val processedData = vectorAssembler.transform(hashedData).select("Label", > "Features") > processedData.show() > } > } > {code}
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967204#comment-15967204 ] Yan Facai (颜发才) commented on SPARK-20081: - How about adding a `setNumClass` to shortcut infer process, like LibSVMDataSource's option numFeatures? > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967196#comment-15967196 ] Yan Facai (颜发才) edited comment on SPARK-20081 at 4/13/17 6:48 AM: -- [~creinig] Christian, RandomForestClassifier use numClass to calculate memory space needed. As far as I know, numClass is inferred by `getNumClass` of Classifier, and now explicitly `setNumClass` is missing. I don't know whether it is in the future plan. Moreover, NominalAttribute is private[ml]. It seems that we cannot modify metadata outside. But, you can use `StringIndexer` to transform your label column, and StringIndexer will help you construct correct metadata (nomAttr.getNumValues). Its usage see: http://spark.apache.org/docs/latest/ml-features.html#stringindexer The solution is a little tricky. How about it? ping [~josephkb] was (Author: facai): [~creinig] Christian, RandomForestClassifier use numClass to calculate memory space needed. As far as I know, numClass is inferred by `getNumClass` of Classifier, and now explicitly `setNumClass` is missing. I don't know whether it is in the future plan. In fact, you can use `StringIndexer` to transform your label column, and StringIndexer will help you construct correct metadata (nomAttr.getNumValues). Its usage see: http://spark.apache.org/docs/latest/ml-features.html#stringindexer The solution is a little tricky. How about it? ping [~josephkb] > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-20081: Component/s: ML > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: ML, MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20081) RandomForestClassifier doesn't seem to support more than 100 labels
[ https://issues.apache.org/jira/browse/SPARK-20081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15967196#comment-15967196 ] Yan Facai (颜发才) commented on SPARK-20081: - [~creinig] Christian, RandomForestClassifier use numClass to calculate memory space needed. As far as I know, numClass is inferred by `getNumClass` of Classifier, and now explicitly `setNumClass` is missing. I don't know whether it is in the future plan. In fact, you can use `StringIndexer` to transform your label column, and StringIndexer will help you construct correct metadata (nomAttr.getNumValues). Its usage see: http://spark.apache.org/docs/latest/ml-features.html#stringindexer The solution is a little tricky. How about it? ping [~josephkb] > RandomForestClassifier doesn't seem to support more than 100 labels > --- > > Key: SPARK-20081 > URL: https://issues.apache.org/jira/browse/SPARK-20081 > Project: Spark > Issue Type: Bug > Components: MLlib >Affects Versions: 2.1.0 > Environment: Java >Reporter: Christian Reiniger > > When feeding data with more than 100 labels into RanfomForestClassifer#fit() > (from java code), I get the following error message: > {code} > Classifier inferred 143 from label values in column > rfc_df0e968db9df__labelCol, but this exceeded the max numClasses (100) > allowed to be inferred from values. > To avoid this error for labels with > 100 classes, specify numClasses > explicitly in the metadata; this can be done by applying StringIndexer to the > label column. > {code} > Setting "numClasses" in the metadata for the label column doesn't make a > difference. Looking at the code, this is not surprising, since > MetadataUtils.getNumClasses() ignores this setting: > {code:language=scala} > def getNumClasses(labelSchema: StructField): Option[Int] = { > Attribute.fromStructField(labelSchema) match { > case binAttr: BinaryAttribute => Some(2) > case nomAttr: NominalAttribute => nomAttr.getNumValues > case _: NumericAttribute | UnresolvedAttribute => None > } > } > {code} > The alternative would be to pass a proper "maxNumClasses" parameter to the > classifier, so that Classifier#getNumClasses() allows a larger number of > auto-detected labels. However, RandomForestClassifer#train() calls > #getNumClasses without the "maxNumClasses" parameter, causing it to use the > default of 100: > {code:language=scala} > override protected def train(dataset: Dataset[_]): > RandomForestClassificationModel = { > val categoricalFeatures: Map[Int, Int] = > MetadataUtils.getCategoricalFeatures(dataset.schema($(featuresCol))) > val numClasses: Int = getNumClasses(dataset) > // ... > {code} > My scala skills are pretty sketchy, so please correct me if I misinterpreted > something. But as it seems right now, there is no way to learn from data with > more than 100 labels via RandomForestClassifier. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966926#comment-15966926 ] Yan Facai (颜发才) commented on SPARK-20199: - It's not hard, and I can work on it. However, there are two possible solutions: 1. add `setFeatureSubsetStrategy` method to DecisionTree. So for GBT, it create an DecesionTree by using the method. code like `val dt = new DecisionTreeRegressor().setFeatureSubsetStrategy(xxx)`. 2. add `featureSubsetStrategy` param for `train` method of DecesionTree. minimum changes. which one is better? I prefer to the first. > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20199) GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter
[ https://issues.apache.org/jira/browse/SPARK-20199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965296#comment-15965296 ] Yan Facai (颜发才) commented on SPARK-20199: - Yes, as [~pralabhkumar] said, DecisionTree hardcodes featureSubsetStrategy. How about adding setFeatureSubsetStrategy for DecisionTree? > GradientBoostedTreesModel doesn't have Column Sampling Rate Paramenter > --- > > Key: SPARK-20199 > URL: https://issues.apache.org/jira/browse/SPARK-20199 > Project: Spark > Issue Type: Improvement > Components: ML, MLlib >Affects Versions: 2.1.0 >Reporter: pralabhkumar >Priority: Minor > > Spark GradientBoostedTreesModel doesn't have Column sampling rate parameter > . This parameter is available in H2O and XGBoost. > Sample from H2O.ai > gbmParams._col_sample_rate > Please provide the parameter . -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963741#comment-15963741 ] Yan Facai (颜发才) edited comment on SPARK-3383 at 4/12/17 2:02 AM: - I think the task contains two subtask: 1. separate `split` with `bin`: Now for each categorical feature, there is 1 bin per split. That's said, for N categories, the communicate cost is 2^(N-1) - 1 bins. However, if we only get stats for each category, and construct splits finally. Namely, 1 bin per category. The communicate cost is N bins. 2. As said in Description, store all but the last bin, and also store the total statistics for each node. The communicate cost will be N-1 bins. I have a question: 1. why unordered features only are allowed in multiclass classification? was (Author: facai): I think the task contains two subtask: 1. separate `split` with `bin`: Now for each categorical feature, there is 1 bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 1 bins. However, if we only get stats for each category, and construct splits finally. Namely, 1 bin per category. The communicate cost is N bins. 2. As said in Description, store all but the last bin, and also store the total statistics for each node. The communicate cost will be N-1 bins. I have a question: 1. why unordered features only are allowed in multiclass classification? > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965263#comment-15965263 ] Yan Facai (颜发才) commented on SPARK-3383: How about the idea? 1. We use `bin` to represent value, which is quantized value for continuous feature and category for discrete feature. In DTStatsAggregator, only collect stats of bins. At the stage, all operators are same, no matter it is continuous or discrete feature. 2. in `binsToBestSplit`, + Continuous / order discrete feature has N bins, and then construct N - 1 splits. + Unorder discrete feature has N bins, and then construct all possible combination, namely, 2^(N-1) - 1 splits. 3. in `binsToBestSplit`, collect all splits and calculate their impurity. Order these splits and find the best one. > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965244#comment-15965244 ] Yan Facai (颜发才) edited comment on SPARK-10788 at 4/12/17 1:35 AM: -- [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? I mean 1 bin per category, instead of 1 bin per split. Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. was (Author: facai): [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-10788) Decision Tree duplicates bins for unordered categorical features
[ https://issues.apache.org/jira/browse/SPARK-10788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15965244#comment-15965244 ] Yan Facai (颜发才) commented on SPARK-10788: - [~josephkb] As categories A, B and C are independent, why not collect statistics only for cateogry? Splits are calculated in the last step in `binsToBestSplit`. So communication cost is N bins. > Decision Tree duplicates bins for unordered categorical features > > > Key: SPARK-10788 > URL: https://issues.apache.org/jira/browse/SPARK-10788 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Seth Hendrickson >Priority: Minor > Fix For: 2.0.0 > > > Decision trees in spark.ml (RandomForest.scala) communicate twice as much > data as needed for unordered categorical features. Here's an example. > Say there are 3 categories A, B, C. We consider 3 splits: > * A vs. B, C > * A, B vs. C > * A, C vs. B > Currently, we collect statistics for each of the 6 subsets of categories (3 * > 2 = 6). However, we could instead collect statistics for the 3 subsets on > the left-hand side of the 3 possible splits: A and A,B and A,C. If we also > have stats for the entire node, then we can compute the stats for the 3 > subsets on the right-hand side of the splits. In pseudomath: {{stats(B,C) = > stats(A,B,C) - stats(A)}}. > We should eliminate these extra bins within the spark.ml implementation since > the spark.mllib implementation will be removed before long (and will instead > call into spark.ml). -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3383) DecisionTree aggregate size could be smaller
[ https://issues.apache.org/jira/browse/SPARK-3383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963741#comment-15963741 ] Yan Facai (颜发才) commented on SPARK-3383: I think the task contains two subtask: 1. separate `split` with `bin`: Now for each categorical feature, there is 1 bin per split. That's said, for N categories, the communicate cost is 2^{N-1} - 1 bins. However, if we only get stats for each category, and construct splits finally. Namely, 1 bin per category. The communicate cost is N bins. 2. As said in Description, store all but the last bin, and also store the total statistics for each node. The communicate cost will be N-1 bins. I have a question: 1. why unordered features only are allowed in multiclass classification? > DecisionTree aggregate size could be smaller > > > Key: SPARK-3383 > URL: https://issues.apache.org/jira/browse/SPARK-3383 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 1.1.0 >Reporter: Joseph K. Bradley >Priority: Minor > > Storage and communication optimization: > DecisionTree aggregate statistics could store less data (described below). > The savings would be significant for datasets with many low-arity categorical > features (binary features, or unordered categorical features). Savings would > be negligible for continuous features. > DecisionTree stores a vector sufficient statistics for each (node, feature, > bin). We could store 1 fewer bin per (node, feature): For a given (node, > feature), if we store these vectors for all but the last bin, and also store > the total statistics for each node, then we could compute the statistics for > the last bin. For binary and unordered categorical features, this would cut > in half the number of bins to store and communicate. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-16957) Use weighted midpoints for split values.
[ https://issues.apache.org/jira/browse/SPARK-16957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15960177#comment-15960177 ] Yan Facai (颜发才) commented on SPARK-16957: - I think that it is helpful for small dataset, while trivial for large dataset. The task is easy. However, is it needed? If the issue would be shepherd, I'd like to work on it. > Use weighted midpoints for split values. > > > Key: SPARK-16957 > URL: https://issues.apache.org/jira/browse/SPARK-16957 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Vladimir Feinberg >Priority: Trivial > > Just like R's gbm, we should be using weighted split points rather than the > actual continuous binned feature values. For instance, in a dataset > containing binary features (that are fed in as continuous ones), our splits > are selected as {{x <= 0.0}} and {{x > 0.0}}. For any real data with some > smoothness qualities, this is asymptotically bad compared to GBM's approach. > The split point should be a weighted split point of the two values of the > "innermost" feature bins; e.g., if there are 30 {{x = 0}} and 10 {{x = 1}}, > the above split should be at {{0.75}}. > Example: > {code} > +++-+-+ > |feature0|feature1|label|count| > +++-+-+ > | 0.0| 0.0| 0.0| 23| > | 1.0| 0.0| 0.0|2| > | 0.0| 0.0| 1.0|2| > | 0.0| 1.0| 0.0|7| > | 1.0| 0.0| 1.0| 23| > | 0.0| 1.0| 1.0| 18| > | 1.0| 1.0| 1.0|7| > | 1.0| 1.0| 0.0| 18| > +++-+-+ > DecisionTreeRegressionModel (uid=dtr_01ae90d489b1) of depth 2 with 7 nodes > If (feature 0 <= 0.0) >If (feature 1 <= 0.0) > Predict: -0.56 >Else (feature 1 > 0.0) > Predict: 0.29333 > Else (feature 0 > 0.0) >If (feature 1 <= 0.0) > Predict: 0.56 >Else (feature 1 > 0.0) > Predict: -0.29333 > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3159) Check for reducible DecisionTree
[ https://issues.apache.org/jira/browse/SPARK-3159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948754#comment-15948754 ] Yan Facai (颜发才) commented on SPARK-3159: [~josephkb] Hi, is the jira still needed? I'd like to work on it. How about adding check and reduce method on LearningNode object? > Check for reducible DecisionTree > > > Key: SPARK-3159 > URL: https://issues.apache.org/jira/browse/SPARK-3159 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Improvement: test-time computation > Currently, pairs of leaf nodes with the same parent can both output the same > prediction. This happens since the splitting criterion (e.g., Gini) is not > the same as prediction accuracy/MSE; the splitting criterion can sometimes be > improved even when both children would still output the same prediction > (e.g., based on the majority label for classification). > We could check the tree and reduce it if possible after training. > Note: This happens with scikit-learn as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accept
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939571#comment-15939571 ] Yan Facai (颜发才) edited comment on SPARK-20043 at 3/24/17 2:15 AM: -- The bug can be reproduced. I'd like to work on it. was (Author: facai): The bug can be reproduced by: ```scala test("cross validation with decision tree") { val dt = new DecisionTreeClassifier() val dtParamMaps = new ParamGridBuilder() .addGrid(dt.impurity, Array("Gini", "Entropy")) .build() val eval = new BinaryClassificationEvaluator val cv = new CrossValidator() .setEstimator(dt) .setEstimatorParamMaps(dtParamMaps) .setEvaluator(eval) .setNumFolds(3) val cvModel = cv.fit(dataset) // copied model must have the same paren. val cv2 = testDefaultReadWrite(cvModel, testParams = false) } ``` I'd like to work on it. > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939571#comment-15939571 ] Yan Facai (颜发才) commented on SPARK-20043: - The bug can be reproduced by: ```scala test("cross validation with decision tree") { val dt = new DecisionTreeClassifier() val dtParamMaps = new ParamGridBuilder() .addGrid(dt.impurity, Array("Gini", "Entropy")) .build() val eval = new BinaryClassificationEvaluator val cv = new CrossValidator() .setEstimator(dt) .setEstimatorParamMaps(dtParamMaps) .setEvaluator(eval) .setNumFolds(3) val cvModel = cv.fit(dataset) // copied model must have the same paren. val cv2 = testDefaultReadWrite(cvModel, testParams = false) } ``` I'd like to work on it. > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Facai (颜发才) updated SPARK-20043: Comment: was deleted (was: [~zsellami] could you give an example of your code? I try to reproduce the bug, ```scala val dt = new DecisionTreeRegressor() val paramMaps = new ParamGridBuilder() .addGrid(dt.impurity, Array("Gini", "Entropy")) .build() ``` however, IiiegalArgumentException is thrown as Gini is not a valid parameter.) > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939560#comment-15939560 ] Yan Facai (颜发才) commented on SPARK-20043: - [~zsellami] could you give an example of your code? I try to reproduce the bug, ```scala val dt = new DecisionTreeRegressor() val paramMaps = new ParamGridBuilder() .addGrid(dt.impurity, Array("Gini", "Entropy")) .build() ``` however, IiiegalArgumentException is thrown as Gini is not a valid parameter. > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20043) CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" on ML random forest and decision. Only "gini" and "entropy" (in lower case) are accepted
[ https://issues.apache.org/jira/browse/SPARK-20043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939533#comment-15939533 ] Yan Facai (颜发才) commented on SPARK-20043: - Perhaps it's better to convert impurity Type after setting method is invoked. > CrossValidatorModel loader does not recognize impurity "Gini" and "Entropy" > on ML random forest and decision. Only "gini" and "entropy" (in lower case) > are accepted > > > Key: SPARK-20043 > URL: https://issues.apache.org/jira/browse/SPARK-20043 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.1.0 >Reporter: Zied Sellami > Labels: starter > > I saved a CrossValidatorModel with a decision tree and a random forest. I use > Paramgrid to test "gini" and "entropy" impurity. CrossValidatorModel are not > able to load the saved model, when impurity are written not in lowercase. I > obtain an error from Spark "impurity Gini (Entropy) not recognized. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3728) RandomForest: Learn models too large to store in memory
[ https://issues.apache.org/jira/browse/SPARK-3728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15936208#comment-15936208 ] Yan Facai (颜发才) commented on SPARK-3728: RandomForest already use a stack to save node, as [~jgfidelis] said before. However, all trees are still kept in memory, see `topNodes`. Perhaps, writing trees to disk is still needed if too many trees trained. > RandomForest: Learn models too large to store in memory > --- > > Key: SPARK-3728 > URL: https://issues.apache.org/jira/browse/SPARK-3728 > Project: Spark > Issue Type: Improvement > Components: MLlib >Reporter: Joseph K. Bradley >Priority: Minor > > Proposal: Write trees to disk as they are learned. > RandomForest currently uses a FIFO queue, which means training all trees at > once via breadth-first search. Using a FILO queue would encourage the code > to finish one tree before moving on to new ones. This would allow the code > to write trees to disk as they are learned. > Note: It would also be possible to write nodes to disk as they are learned > using a FIFO queue, once the example--node mapping is cached [JIRA]. The > [Sequoia Forest package]() does this. However, it could be useful to learn > trees progressively, so that future functionality such as early stopping > (training fewer trees than expected) could be supported. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org