[jira] [Updated] (SPARK-21268) Move center calculations to a distributed map in KMeans
[ https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillaume Dardelet updated SPARK-21268: --- Description: As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance. Edit: Per Sean Owen recommendations, scal() and VectorWithNorm creation should be computed in a distributed map before the collectAsMap. was: As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance. Edit: > Move center calculations to a distributed map in KMeans > --- > > Key: SPARK-21268 > URL: https://issues.apache.org/jira/browse/SPARK-21268 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: Guillaume Dardelet >Priority: Trivial > Labels: beginner, easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > As I was monitoring the perfomance of my algorithm with SparkUI, I noticed > that their was a "collectAsMap" operation that was done hundreds of time at > every iteration of Kmeans: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 > It would work just as well by performing the following "foreach" on the RDD, > and would slightly improve perfomance. > Edit: > Per Sean Owen recommendations, scal() and VectorWithNorm creation should be > computed in a distributed map before the collectAsMap. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21268) Move center calculations to a distributed map in KMeans
[ https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillaume Dardelet updated SPARK-21268: --- Summary: Move center calculations to a distributed map in KMeans (was: Move some calculations to a distributed map in KMeans) > Move center calculations to a distributed map in KMeans > --- > > Key: SPARK-21268 > URL: https://issues.apache.org/jira/browse/SPARK-21268 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: Guillaume Dardelet >Priority: Trivial > Labels: beginner, easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > As I was monitoring the perfomance of my algorithm with SparkUI, I noticed > that their was a "collectAsMap" operation that was done hundreds of time at > every iteration of Kmeans: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 > It would work just as well by performing the following "foreach" on the RDD, > and would slightly improve perfomance. > Edit: -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21268) Move some calculations to a distributed map in KMeans
[ https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillaume Dardelet updated SPARK-21268: --- Description: As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance. Edit: was: As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance. Thoughts ? > Move some calculations to a distributed map in KMeans > - > > Key: SPARK-21268 > URL: https://issues.apache.org/jira/browse/SPARK-21268 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: Guillaume Dardelet >Priority: Trivial > Labels: beginner, easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > As I was monitoring the perfomance of my algorithm with SparkUI, I noticed > that their was a "collectAsMap" operation that was done hundreds of time at > every iteration of Kmeans: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 > It would work just as well by performing the following "foreach" on the RDD, > and would slightly improve perfomance. > Edit: -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-21268) Move some calculations to a distributed map in KMeans
[ https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Guillaume Dardelet updated SPARK-21268: --- Summary: Move some calculations to a distributed map in KMeans (was: Redundant collectAsMap in KMeans) > Move some calculations to a distributed map in KMeans > - > > Key: SPARK-21268 > URL: https://issues.apache.org/jira/browse/SPARK-21268 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: Guillaume Dardelet >Priority: Trivial > Labels: beginner, easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > As I was monitoring the perfomance of my algorithm with SparkUI, I noticed > that their was a "collectAsMap" operation that was done hundreds of time at > every iteration of Kmeans: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 > It would work just as well by performing the following "foreach" on the RDD, > and would slightly improve perfomance. > Thoughts ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21268) Redundant collectAsMap in KMeans
[ https://issues.apache.org/jira/browse/SPARK-21268?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070177#comment-16070177 ] Guillaume Dardelet commented on SPARK-21268: Ok I understand why it is useful now thank you. As for performing the scale and newCenter creation, should I create a new issue or reference this one in the pull request ? > Redundant collectAsMap in KMeans > > > Key: SPARK-21268 > URL: https://issues.apache.org/jira/browse/SPARK-21268 > Project: Spark > Issue Type: Improvement > Components: MLlib >Affects Versions: 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1 >Reporter: Guillaume Dardelet >Priority: Trivial > Labels: beginner, easyfix, newbie > Original Estimate: 1h > Remaining Estimate: 1h > > As I was monitoring the perfomance of my algorithm with SparkUI, I noticed > that their was a "collectAsMap" operation that was done hundreds of time at > every iteration of Kmeans: > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 > It would work just as well by performing the following "foreach" on the RDD, > and would slightly improve perfomance. > Thoughts ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-21268) Redundant collectAsMap in KMeans
Guillaume Dardelet created SPARK-21268: -- Summary: Redundant collectAsMap in KMeans Key: SPARK-21268 URL: https://issues.apache.org/jira/browse/SPARK-21268 Project: Spark Issue Type: Improvement Components: MLlib Affects Versions: 2.1.1, 2.1.0, 2.0.2, 2.0.1, 2.0.0 Reporter: Guillaume Dardelet Priority: Trivial As I was monitoring the perfomance of my algorithm with SparkUI, I noticed that their was a "collectAsMap" operation that was done hundreds of time at every iteration of Kmeans: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala#L295 It would work just as well by performing the following "foreach" on the RDD, and would slightly improve perfomance. Thoughts ? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?
[ https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191 ] Guillaume Dardelet edited comment on SPARK-12606 at 6/1/17 9:55 AM: I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of {code} class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String => String = ??? protected def outputDataType: DataType = StringType } {code} Do this: {code} class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String => String = ??? protected def outputDataType: DataType = StringType } {code} was (Author: panoramix): I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of {code} class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} Do this: {code} class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} > Scala/Java compatibility issue Re: how to extend java transformer from Scala > UnaryTransformer ? > --- > > Key: SPARK-12606 > URL: https://issues.apache.org/jira/browse/SPARK-12606 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 > Environment: Java 8, Mac OS, Spark-1.5.2 >Reporter: Andrew Davidson > Labels: transformers > > Hi Andy, > I suspect that you hit the Scala/Java compatibility issue, I can also > reproduce this issue, so could you file a JIRA to track this issue? > Yanbo > 2016-01-02 3:38 GMT+08:00 Andy Davidson : > I am trying to write a trivial transformer I use use in my pipeline. I am > using java and spark 1.5.2. It was suggested that I use the Tokenize.scala > class as an example. This should be very easy how ever I do not understand > Scala, I am having trouble debugging the following exception. > Any help would be greatly appreciated. > Happy New Year > Andy > java.lang.IllegalArgumentException: requirement failed: Param null__inputCol > does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c. > at scala.Predef$.require(Predef.scala:233) > at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557) > at org.apache.spark.ml.param.Params$class.set(params.scala:436) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at org.apache.spark.ml.param.Params$class.set(params.scala:422) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at > org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83) > at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30) > public class StemmerTest extends AbstractSparkTest { > @Test > public void test() { > Stemmer stemmer = new Stemmer() > .setInputCol("raw”) //line 30 > .setOutputCol("filtered"); > } > } > /** > * @ see > spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala > * @ see > https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > * @ see > http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/ > * > * @author andrewdavidson > * > */ > public class Stemmer extends UnaryTransformer, List, > Stemmer> implements Serializable{ > static Logger logger = LoggerFactory.getLogger(Stemmer.class); > private static final long serialVersionUID = 1L; > private static final ArrayType inputType = > DataTypes.createArrayType(DataTypes.StringType, true); > private final String uid = Stemmer.class.getSimpleName() + "_" + > UUID.randomUUID().toString(); > @Override > public
[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?
[ https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191 ] Guillaume Dardelet edited comment on SPARK-12606 at 4/4/17 2:27 PM: I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of {code:scala} class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} Do this: class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } was (Author: panoramix): I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } Do this: class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } > Scala/Java compatibility issue Re: how to extend java transformer from Scala > UnaryTransformer ? > --- > > Key: SPARK-12606 > URL: https://issues.apache.org/jira/browse/SPARK-12606 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 > Environment: Java 8, Mac OS, Spark-1.5.2 >Reporter: Andrew Davidson > Labels: transformers > > Hi Andy, > I suspect that you hit the Scala/Java compatibility issue, I can also > reproduce this issue, so could you file a JIRA to track this issue? > Yanbo > 2016-01-02 3:38 GMT+08:00 Andy Davidson : > I am trying to write a trivial transformer I use use in my pipeline. I am > using java and spark 1.5.2. It was suggested that I use the Tokenize.scala > class as an example. This should be very easy how ever I do not understand > Scala, I am having trouble debugging the following exception. > Any help would be greatly appreciated. > Happy New Year > Andy > java.lang.IllegalArgumentException: requirement failed: Param null__inputCol > does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c. > at scala.Predef$.require(Predef.scala:233) > at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557) > at org.apache.spark.ml.param.Params$class.set(params.scala:436) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at org.apache.spark.ml.param.Params$class.set(params.scala:422) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at > org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83) > at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30) > public class StemmerTest extends AbstractSparkTest { > @Test > public void test() { > Stemmer stemmer = new Stemmer() > .setInputCol("raw”) //line 30 > .setOutputCol("filtered"); > } > } > /** > * @ see > spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala > * @ see > https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > * @ see > http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/ > * > * @author andrewdavidson > * > */ > public class Stemmer extends UnaryTransformer, List, > Stemmer> implements Serializable{ > static Logger logger = LoggerFactory.getLogger(Stemmer.class); > private static final long serialVersionUID = 1L; > private static final ArrayType inputType = > DataTypes.createArrayType(DataTypes.StringType, true); > private final String uid = Stemmer.class.getSimpleName() + "_" + > UUID.randomUUID().toString(); > @Override > public String uid() { > return u
[jira] [Comment Edited] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?
[ https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191 ] Guillaume Dardelet edited comment on SPARK-12606 at 4/4/17 2:27 PM: I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of {code} class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} Do this: {code} class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} was (Author: panoramix): I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of {code:scala} class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } {code} Do this: class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } > Scala/Java compatibility issue Re: how to extend java transformer from Scala > UnaryTransformer ? > --- > > Key: SPARK-12606 > URL: https://issues.apache.org/jira/browse/SPARK-12606 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 > Environment: Java 8, Mac OS, Spark-1.5.2 >Reporter: Andrew Davidson > Labels: transformers > > Hi Andy, > I suspect that you hit the Scala/Java compatibility issue, I can also > reproduce this issue, so could you file a JIRA to track this issue? > Yanbo > 2016-01-02 3:38 GMT+08:00 Andy Davidson : > I am trying to write a trivial transformer I use use in my pipeline. I am > using java and spark 1.5.2. It was suggested that I use the Tokenize.scala > class as an example. This should be very easy how ever I do not understand > Scala, I am having trouble debugging the following exception. > Any help would be greatly appreciated. > Happy New Year > Andy > java.lang.IllegalArgumentException: requirement failed: Param null__inputCol > does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c. > at scala.Predef$.require(Predef.scala:233) > at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557) > at org.apache.spark.ml.param.Params$class.set(params.scala:436) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at org.apache.spark.ml.param.Params$class.set(params.scala:422) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at > org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83) > at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30) > public class StemmerTest extends AbstractSparkTest { > @Test > public void test() { > Stemmer stemmer = new Stemmer() > .setInputCol("raw”) //line 30 > .setOutputCol("filtered"); > } > } > /** > * @ see > spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala > * @ see > https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > * @ see > http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/ > * > * @author andrewdavidson > * > */ > public class Stemmer extends UnaryTransformer, List, > Stemmer> implements Serializable{ > static Logger logger = LoggerFactory.getLogger(Stemmer.class); > private static final long serialVersionUID = 1L; > private static final ArrayType inputType = > DataTypes.createArrayType(DataTypes.StringType, true); > private final String uid = Stemmer.class.getSimpleName() + "_" + > UUID.randomUUID().toString(); > @Override > public Strin
[jira] [Commented] (SPARK-12606) Scala/Java compatibility issue Re: how to extend java transformer from Scala UnaryTransformer ?
[ https://issues.apache.org/jira/browse/SPARK-12606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15955191#comment-15955191 ] Guillaume Dardelet commented on SPARK-12606: I had the same issue in Scala and I solved it by overloading the constructor so that it initialises the UID. The error comes from the initialisation of the parameter "inputCol". You get "null__inputCol" because when the parameter was initialised, your class didn't have a uid. Therefore, instead of class Lemmatizer extends UnaryTransformer[String, String, Lemmatizer] { override val uid: String = Identifiable.randomUID("lemmatizer") protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } Do this: class Lemmatizer(override val uid: String) extends UnaryTransformer[String, String, Lemmatizer] { def this() = this( Identifiable.randomUID("lemmatizer") ) protected def createTransformFunc: String) => String = ??? protected def outputDataType: DataType = StringType } > Scala/Java compatibility issue Re: how to extend java transformer from Scala > UnaryTransformer ? > --- > > Key: SPARK-12606 > URL: https://issues.apache.org/jira/browse/SPARK-12606 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 1.5.2 > Environment: Java 8, Mac OS, Spark-1.5.2 >Reporter: Andrew Davidson > Labels: transformers > > Hi Andy, > I suspect that you hit the Scala/Java compatibility issue, I can also > reproduce this issue, so could you file a JIRA to track this issue? > Yanbo > 2016-01-02 3:38 GMT+08:00 Andy Davidson : > I am trying to write a trivial transformer I use use in my pipeline. I am > using java and spark 1.5.2. It was suggested that I use the Tokenize.scala > class as an example. This should be very easy how ever I do not understand > Scala, I am having trouble debugging the following exception. > Any help would be greatly appreciated. > Happy New Year > Andy > java.lang.IllegalArgumentException: requirement failed: Param null__inputCol > does not belong to Stemmer_2f3aa96d-7919-4eaa-ad54-f7c620b92d1c. > at scala.Predef$.require(Predef.scala:233) > at org.apache.spark.ml.param.Params$class.shouldOwn(params.scala:557) > at org.apache.spark.ml.param.Params$class.set(params.scala:436) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at org.apache.spark.ml.param.Params$class.set(params.scala:422) > at org.apache.spark.ml.PipelineStage.set(Pipeline.scala:37) > at > org.apache.spark.ml.UnaryTransformer.setInputCol(Transformer.scala:83) > at com.pws.xxx.ml.StemmerTest.test(StemmerTest.java:30) > public class StemmerTest extends AbstractSparkTest { > @Test > public void test() { > Stemmer stemmer = new Stemmer() > .setInputCol("raw”) //line 30 > .setOutputCol("filtered"); > } > } > /** > * @ see > spark-1.5.1/mllib/src/main/scala/org/apache/spark/ml/feature/Tokenizer.scala > * @ see > https://chimpler.wordpress.com/2014/06/11/classifiying-documents-using-naive-bayes-on-apache-spark-mllib/ > * @ see > http://www.tonytruong.net/movie-rating-prediction-with-apache-spark-and-hortonworks/ > * > * @author andrewdavidson > * > */ > public class Stemmer extends UnaryTransformer, List, > Stemmer> implements Serializable{ > static Logger logger = LoggerFactory.getLogger(Stemmer.class); > private static final long serialVersionUID = 1L; > private static final ArrayType inputType = > DataTypes.createArrayType(DataTypes.StringType, true); > private final String uid = Stemmer.class.getSimpleName() + "_" + > UUID.randomUUID().toString(); > @Override > public String uid() { > return uid; > } > /* >override protected def validateInputType(inputType: DataType): Unit = { > require(inputType == StringType, s"Input type must be string type but got > $inputType.") > } > */ > @Override > public void validateInputType(DataType inputTypeArg) { > String msg = "inputType must be " + inputType.simpleString() + " but > got " + inputTypeArg.simpleString(); > assert (inputType.equals(inputTypeArg)) : msg; > } > > @Override > public Function1, List> createTransformFunc() { > // > http://stackoverflow.com/questions/6545066/using-scala-from-java-passing-functions-as-parameters > Function1, List> f = new > AbstractFunction1, List>() { > public List apply(List words) { > for(String word : words) { > logger.error("AEDWIP input word: {}", word); > } > return wor